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In the uniqueness part of a geophysical inverse problem, the 
observer wants to predict all likely values of P unknown numerical 
properties z=(zi,...,z/>) of the earth from measurement of D other 
numerical properties y (0) =(y i (0) , using full or partial 
knowledge of the statistical distribution of the random errors in y (0) . 
The data space Y containing y (0) is D -dimensional, so when the 
model space X is infinite-dimensional the linear uniqueness prob- 
lem usually is insoluble without prior information about the correct 
earth model x. If that information is a quadratic bound on x (e.g., 
energy or dissipation rate), Bayesian inference (BI) and stochastic 
inversion (SI) inject spurious structure into x, implied by neither 
the data nor the quadratic bound. Confidence set inference (CSI) 
provides an alternative inversion technique free of this objection. 
The first step in CSI is to estimate unmodelled systematic errors in 
y (0) and z. The second step is to choose any finite-dimensional sub- 
space X N of X , and to use the prior quadratic bound to estimate the 
truncation error when the full data function F :X —>Y in the for- 
ward problem is approximated by restricting it to X N to give a 
finite-dimensional function F N : X N -» Y . Step three calculates the 
eigenstructure (singular value decomposition) of F N . Step 4 uses 
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this eigenstructure to find for each positive p < la Neyman subset 
K z (p) of the P -dimensional prediction space Z such that either the 
correct value of the prediction vector z is a member of the 
confidence set K z (p ) or an event has occurred whose probability 
was no more thanp. In contrast to SI and BI, CSI offers no incen- 
tive for considering any value of P except 1. CSI is illustrated in 
the problem of estimating the geomagnetic field B at the core- 
mantle boundary (CMB) from components of B measured on or 
above the earth’s surface. Neither the heat flow nor the energy 
bound is strong enough to permit estimation of B r at single points 
on the CMB, but the heat flow bound permits estimation of uniform 
averages of B r over discs on the CMB, and both bounds permit 
weighted disc-averages with continous weighting kernels. Both 
bounds also permit estimation of low-degree Gauss coefficients at 
the CMB. The heat flow bound resolves them up to degree 8 if the 
crustal field at satellite altitudes must be treated as a systematic 
error, but can resolve to degree 1 1 under the most favorable statisti- 
cal treatment of the crust. These two limits produce circles of con- 
fusion on the CMB with diameters of 25° and 19° respectively. 

Key words: confidence sets, inverse problems, core-mantle boun- 
dary field 
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1 INTRODUCTION 

In geophysical inverse problems, the data are finitely many real numbers y { 0> y& 0) measured 

with unknown observational errors 8y Syp°\ The goal is to use the data to estimate proper- 
ties of the real earth. A common intermediate step is the existence problem, to find one physi- 
cally reasonable earth model x (0) which fits the data acceptably (i.e., within the random errors of 
observation and the systematic errors of modelling). Usually the model space X of all possible 
earth models x is infinite dimensional, so many physically reasonable models besides x (0) will fit 
the data acceptably. Thus besides the existence problem there is a uniqueness problem, to esti- 
mate the "error" in x (0) , i.e., to discover how much the real earth might deviate from x (0) . 

The uniqueness problem must be carefully formulated if it is to be amenable to quantitative 
solution. Collect the data into an observed data vector y (0) = (y , .... y^ 0) ), and let Y be the data 
space, the vector space (linear space) of all possible D -dimensional data vectors. The theory of 
the forward problem provides a function F which assigns to each earth model x in X the value 
F (x) which y (0) would take if x were the correct earth model and there were no errors. The errors 
in y r0) are of two types: the random error of observation, < 5 y = (6y x , .... 5y D )\ and the systematic 
errors = (77 lt ... ,r ] D ) due to the inadequacy of the model space X. If x is the correct model, then 

y(°) — — F (x) + <Sy +77 . (1.1a) 

The hope is to use the data y (0) to estimate properties of the real earth. The complete infor- 
mation content of an arbitrary infinite-dimensional model vector x can be neither registered in a 
finite computer nor comprehended by a finite observer, so in any real inverse problem the data 
y<°) will be used to estimate a finite number of properties of the earth. In the present paper it will 
be assumed that these properties can be described by a finite list of real numbers which together 
constitute a "prediction vector," z = {z x ,...,z P ), in the /’-dimensional "prediction space" Z. The 
theory of the forward problem provides a function G which assigns to each model x in X the 
value G (x) that z would take if x were the correct model and there were no errors. Thus 
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z = G(x) + C (1.1b) 

where £ is the systematic error produced by inadequacies in the model space. Since (1.1b) is a 
prediction rather than an observation, it contains no random error <5z of measurement. That ran- 
dom error would enter only in a later attempt to verify the prediction by direct measurement, and 
it need not encumber the inverse problem. 

The uniqueness part of the inverse problem consists in using y (0) in (1.1a) to put limits on x, 
and transferring those limits to z via (1.1b). Of course this program is doomed unless estimates 
are available for the errors Sy, rj and £. By definition, a systematic error like tj or £ is one about 
which only inequalities are known, while the random error Sy can be thought of as drawn at ran- 
dom from a population of error vectors in Y with a probability distribution p E on Y . 

Even if p E for Sy and inequalities fort] and £ are known, the uniqueness problem is usually 
insoluble for another reason, basically because dim Y < dimX, so (1.1) has more unknowns than 
equations. The difficulty is obvious in the linear case. Let F, and Gj be the real-valued functions 
on X defined by 

F (x) = (F i(x), ..., Fp (x)) (1.2a) 

and 

G(x) = (G,(x) G P {x)). (1.2b) 

If F and G are linear, so are the data functionals F, and the prediction functionals Gj . If each Gj 
is a linear combination of F h ...,F D , then estimates of z are possible (Backus & Gilbert, 1968, 
1970), but otherwise the set of z’s permitted by the data is unbounded (Backus, 1970a). 

When the prediction functionals are not linear combinations of the data functionals, the 
observer who wants to estimate (i.e., bound) z from (1.1b) must invoke more information about 
the earth than is contained in (1.1a). This "prior information" comes from previous knowledge 
about the earth in particular and about physics and chemistry in general. For example, in 
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attempting to invert seismic data to find the density p and seismic velocities V P and V^, the 
seismologist is quite sure thatp, V P and V s are positive everywhere, and is rather confident of 
upper bounds for them based on laboratory measurements, the theory' of condensed matter, and 
solar and astronomical observations which contribute to the discussion of the likely range of 
chemical compositions of the earth. Examples of prior information in modelling the geomagnetic 
field are that its total energy cannot have a rest mass greater than that of the earth (the energy 
bound), and that the minimum ohmic heating rate required to sustain it is probably less than the 
total rate of geothermal heat flow observed at the earth’s surface (the heat flow bound). 

The seismic prior information just mentioned can be stated in terms of "linear bounds." 
There are M linear functions//, i = l,...,Af , which assign real numbers /,(x) to all earth models 
x, and there are 2N real numbers a, < b t such that the correct x is known to satisfy 

a,- <fi(x)<bi (1.3) 

for i = 1 M. In the seismic example, the model x is the triple of functions p, V P , V$, and 

/, (x) is the value of one ofp, V P or V s at a particular location r ; in the earth. Ev idently M =°° 
in the seismic case. This is essential, because for M <°° the f simply augment the 

number of data functionals from D to D-fM, and if dim X = <*= it is still true that 
D + M < dim AT. When M =«>, linear constraints are extremely useful in resolving nonunique- 
ness. Seismic examples are given by Wiggins et al. (1973), Garmany (1979), Orcutt (1980), 
Stark et al. (1986), Stark (1987, 1988), and Stark and Parker (1987). Examples from gravity 
inversion appear in Parker (1974, 1975), and examples from geomagnetic prospecting are given 
by Oldenburg (1983) and Huestis and Parker (1977). McNutt and Royden (1988) give an appli- 
cation to the geothenn. In all these cases, the numerical inversion technique is linear program- 
ming (Gass, 1985) but many of the problems are nonlinear and involve very' adroit use of the con- 
straints. 
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The energy and heat flow bounds in the geomagnetic modelling problem have a mathemati- 
cal character different from (1.3). They are both of the form 

Q(x,x)<q (1.4a) 

where q is a known real number and Q is a known positive-definite quadratic form in the model 
x. That is, Q (x, x) is a homogeneous quadratic polynomial in x, and 

Q (x, x) > 0 (1.4b) 

for every nonzero model x. The bound (1.4) is a prior quadratic bound on the correct earth model 
x. A single prior quadratic bound on x always confines the prediction vector z to a bounded sub- 

i 

set K z of the prediction space Z, whereas infinitely many linear bounds are needed to do so if 
dim AT =«. Whether K z is small enough to be useful must usually be settled by numerical calcu- 
lation. One of the conclusions reached in this paper is that very lax prior quadratic bounds on x 
can produce surprisingly tight bounds on z if the data are adequate. The use of a prior quadratic 
bound was suggested by Backus (1970a), Jackson (1979) and Gubbins and Bloxham (1985). 

Jackson (1979) calls (1.3) and (1.4) "hard bounds" on the correct earth model x, as opposed 
to "soft bounds." A soft bound is a probability distribution p x on the model space X, which 
describes where in X the correct model x is likely to be. A prior soft bound p x can be subjective 
or objective. If subjective, it is the observer’s personal probability distribution for x, incorporat- 
ing all his or her knowledge about x except the data y®\ Measuring y^ increases the observer’s 
knowledge, and changes p x from a prior to a posterior personal probability distribution. An 
objective p x arises when there really are many different earth models, for example many different 
ocean-bottom magnetic anomaly patterns, all sampled by surface magnetometers, and some sam- 
pled by deep-towed magnetometers. The deep-towed samples constitute an objective data base 
from which p x can be estimated by standard statistical techniques. This p x can then be used to 
interpret a surface record from a new area. 

i 

I 


September 20, 1988 



George E. Backus 


Confidence Set Inference 7 


The two currently popular methods for incorporating prior information into the inverse 
problem (1.1) are stochastic inversion (SI) and Bayesian inference (BI). Both methods use soft 
prior information, a probability distribution p x for models x in X. Stochastic inversion is essen- 
tially minimum variance estimation, while Bayesian inference uses Bayes’ theorem to combine 
p x with y (0) and the error distribution p E to produce a new probability distribution p x for x in X. 
Both SI and BI can be applied to either a subjective or an objective prior p x . When p x and p E 
are Gaussian and (1.1) is linear, BI and SI produce the same results. A review of the two methods 
and a bibliography appears in Backus (1988a). Tarantola’s (1987) book about BI appeared after 
that bibliography was assembled. 

In the geophysical literature it has become common practice to apply BI and SI to hard prior 
information like (1.3) and (1.4). This requires that first the hard information be "softened" to a 
prior probability distribution p x . Backus (1970c, 1988a), Jackson (1979) and Gubbins and Blox- 
ham (1985) discuss the softening process. Backus (1988b) recently shewed that when 
dim A' = softening the hard quadratic bound (1.4) to a probability distribution inevitably adds 
spurious information about the correct model x. This new information is subtle, and can easily 
escape the observer’s attention. It is definitely not implied by the original inequality (1.4), and is 
of a character likely to be unacceptable to most observers. For example, every p x which adds no 
other information to (1.4) will assign prior probability zero to the set of models x for which 
Q (x, x) is finite. When dim X =«> and the prior information is a quadratic bound (1.4), neither SI 
nor BI is a suitable technique for resolving nonuniqueness in the inverse problem (1.1). 

The present paper develops Neyman’s (1937) theory of confidence sets as a replacement for 
BI and SI when the prior information is a hard quadratic bound (1.4). The general idea of 
confidence set inference (CSI) is described by Backus (1987) and Stark (1988), and in fact is a 
special case of a well-established scheme for statistical inference, as will be shown in sections 3 
and 4 below. With CSI, the inevitable and often very large uncertainty in the value of q in (1.4a) 
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cannot be dealt with by "softening" (1.4a) to a probability distribution on X. Therefore, conclu- 
sions drawn with CSI must be tested for sensitivity to q over a range of values. In the geomag- 
netic problem, it will be seen that altering q by several factors of 10 docs not appreciably affect 
the conclusions. 

The present paper advocates replacing BI and SI by CSI only in certain circumstances. If 
the prior information really is a probability distribution p x for x on X, and the observer has 
confidence in it, then BI and SI are preferable to CSI. To use CSI on p x , the observer must con- 
vert p x to a quadratic inequality (hardening the soft information, in the terminology of Jackson, 
1979). When dimX =°°, this hardening process discards large amounts of prior information 
(Backus, 1988b). 

In the existence problem as well, SI and BI are unobjectionable techniques. When 
dimX » dim Y, computational schemes for constructing an x (0) which acceptably fits the data 
are prone to numerical instability because so many models are acceptable. Tikhonov (1963) and 
Tikhonov and Arsenin (1972) suggested a general remedy now called regularization: add another 
condition on x besides (1.1) which will make x (0) unique, so the computer does not go into hunt- 
ing oscillation or wander off to infinity in the high wave-number part of the model space. The 
oldest regularization scheme is simple truncation of the model space to an AT -dimensional sub- 
space X N , from which an x (0) is selected by a least-squares fit to the data. If the fit is unsatisfac- 
tory, N is increased. Tikhonov considered other regularization schemes, most of which seek the 
x that acceptably fits the data and minimizes some norm on X . Geophysical examples occur in 
Backus and Gilbert (1967) and Shure et al. (1982). Both BI and SI offer such techniques, and 
when so used they solve the existence part of the inverse problem but not the uniqueness part 
This distinction is not always made clear in the literature. 

One of the most convincing examples of prior quadratic information in the form (1.4) arises 
in trying to estimate the geomagnetic field B at the core-mantle boundary (CMB) from 
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measurements of B made on and above the surface of the earth. Therefore, the geomagnetic 
problem will be used to illustrate the general theory of CSI. The next section describes the 
geomagnetic problem in detail, and uses it to clarify some general issues in inversion. The 
remainder of the paper is a description of CSI, followed by its detailed application to the geomag- 
netic problem. 


2 THE GEOMAGNETIC INVERSE PROBLEM 

In the geomagnetic inverse problem used here to illustrate CSI, the model space X is the linear 
space of all possible geomagnetic fields B produced outside the core-mantle boundary (CMB) by 
electric currents inside the core. Every such B is defined only in the region from the CMB to 
infinity, and can be written there as 

B = -Vyr (2.1a) 

where 

OO / 

V'(r) = a £ ( a/r )‘ + ' E £/"(«) Y " ^ • ( 2 - lb ) 

/ = 1 /n=— / 

Here a is the radius of the CMB, r is the position vector measured from the center of the earth, r 

is | r | , f is r/r , and { Yf 4 y/) is any orthogonal basis for the spherical harmonics of degree / , 

normalized so that |17 "| 2 averages to (2/+1) -1 on the surface of the unit sphere. Th e/J/”(a) are 
the internal Gauss coefficients of B at the CMB. They can be used to parameterize the model 
space X . Alternatively, X can be parameterized by any parameterization of the set of scalar- 
valued functions on the CMB, because B outside the core is completely determined by the radial 
com pn .rent B r on the CMB (Backus, 1986, gives several parametrizations). In this formulation 
of the inverse problem, the true values of B r on and above the CMB will include unmodelled 
contributions from sources outside the core. These are treated as errors in the data . 
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In this geomagnetic inverse problem, the data are 

y,- = v, • B(r,) (2.2) 

where, for / = 1, ...,D , v, is a unit vector and r, is a position on or above the surface of the earth. 
Equations (2.1) and (2.2) describe how to calculate the data produced by any model B if the 
errors vanish. Therefore, (2.1) and (2.2) give the forward data function F in (1.1a). This F is 
linear, but it would be nonlinear if some of the data were intensities or angles rather than Carte- 
sian components of B. 

For any i among 1, ...,D , if v ( - is the unit vector appearing in (2.2) then the y/ 0) of (1.1a) is 
the measured v ; component of the magnetic field at r ; . Thus y/ 0) includes the instrument errors, 
position errors, and the field produced by all sources outside the core (since (2.1) models only 
internal sources). These are magnetization in the crust and the satellite, and electric currents in 
the mantle, crust, ocean, ionosphere, satellite, and magnetosphere. The observer has wide lati- 
tude to apportion these contributions to y^ among the three terms in (1.1a): F(x), Sy and tj. For 
example, the instrument and navigation errors are likely to be treated as part of the random error 
Sy. The contribution to v (0 ' from crustal magnetization might be unknown, but an upper bound 
on its magnitude might be available; then it would belong in the systematic error tj . Alterna- 
tively, perhaps the crustal magnetization is well-described by a two-dimensional random process 
on the surface of the earth. Then its contribution to could be included in Sy. This statistical 
hypothesis about p E , the probability distribution for Sy, could be tested by the methods described 
in section 3. Finally, if a representation for B more general than (2.1) were adopted (Backus, 
1986), any one of the sources of B could be included as part of the earth model x, and its contri- 
bution to B computed as part of F(x). 

The predictions 2 ,- in (1.1b) might be the 224 Gauss coefficients of degree less than 15, or 
the values of B r at 300 points on or above the CMB, or the flux of B through certain null-flux 
curves on the CMB (Backus, 1968; Gubbins and Bloxham, 1985). In the first two cases, G in 
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(1.1b) is linear, and in the first case f =0 in (1.1b) because the prediction can be made exactly 
once the model is known. In the last two cases, £ *0 because the prediction includes a contribu- 
tion from magnetic fields with sources outside the core. 

In the geomagnetic example, prior inequalities (1.4) can be obtained for both energy and 
dissipation rate. Einstein’s relativity requires that the rest mass of the energy in B be part of the 
total mass of the earth, as measured by surface gravity and the Cavendish experiment for 
Newton’s gravitational constant. In the notation of (2.1), it follows that (Backus, 1988a) 

o© / 

X (2/+1X/+1)- 1 X 1)3 /"(a ) | 2 < 2 x 1 0 33 nT 2 (2.3) 

/=! m— / 

if the p "(a) are measured in nanoTesla. Alternatively, the electrical conductivity of the core 
probably is known to within a factor of 10, and the total geothermal heat flux is known with 
somewhat better accuracy (Stacey, 1977). If ohmic dissipation in the core does not exceed the 
geothermal heat flux then (Backus, 1988a, based on Parker, 1972, and Gubbins, 1975) 

X (/+l)(2/+l)(2/+3)/" 1 X l/3/ m ( fl )l 2 -3x l0 17 nT 2 . (2.4) 

1=1 

Since the gauss coefficients are probably of the order of 10 5 nT at the CMB, bounds like (2.3) and 
(2.4) appear at first sight to be unhelpful. It is one of the counterintuitive properties of BI on 
model spaces of high dimension that geographically well-distributed measurements of B can use 
a probabilistic (softened) version of (2.4) to find p i°(a ) from satellite data correct within about 
one part in 10 5 , and to find <?io( a ) within five percent (R.A. Langel and R.H. Estes, private com- 
munication). Whether this accuracy survives the replacement of BI by CSI will be answered by 
future computations, to be reported elsewhere. 

Other quadratic forms besides (2.3) and (2.4) have been considered in geomagnetic model- 
ling (see, e.g., Shure et al., 1982), but all have been isotropic, i.e., invariant under rotations about 
the center of the earth, and so have had expressions of the form 
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£c(Z) £ l/Wl 2 ^<7 (2-5) 

/ = 1 m=-/ 

for various choices of C (/). 

3 CONFIDENCE SET INFERENCE IN GENERAL 

Confidence set inference (CSI) is an extension of the parametric method of statistical inference 
originally proposed by Neyman (1937). As a formal procedure, it applies to all inverse problems, 
linear or not. Useful results are extracted in particular problems by exploiting their special 
characteristics, such as linearity or the availability of prior information. The following discussion 
of CSI draws heavily on Chapter 34 of Cramer (1946). The necessary measure- and set-theoretic 
notations are listed in Appendix A. 

In CSI, the observer has measured a data vector y® = (y in the D -dimensional 

data space Y =R° . ( R is the real line and R° is the space of the real 1 xD matrices.) The 
observer knows that 

y(0) = y(0 ) + T ; (3-la) 

where y^ and tj are two vectors about which only the following partial information is available: 
there is a given (usually small) subset S Y of Y such that 

r jeS Y \ (3.1b) 

and 5 >(0) was drawn at random from a population whose probability distribution on Y i spy. This 
py is not known, but it is known to belong to a given m -parameter family of probability distribu- 
tions on Y . The parameters, 6 h ■■■,&„ , will be collected into a parameter vector 

e:=(e u ...,e A ) (3.ic) 

in the m -dimensional space Q=R™ of parameter vectors. (":=" means "is defined as.’’) The pro- 
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bability distribution whose parameter vector is 6 will be written p(0, •). All the p(6, •), for all 
0 6 0, are assumed to have the same domain Zy, a particular cr-ring of subsets of Y. If Q e Zy, 
then the probability assigned to Q by p (0, •) will be written p(6,Q). There is no restriction on 
how the p ( 0 , •) depend on 0. They could all be Gaussian, with means and variance matrices in Y 
specified as functions of0; or they could be spline fits to an empirical distribution obtained from 
the data (Constable and Parker, 1988). Neyman (1937) requires that the function Q ■) be 

injective, but no such demand need be made here; p(d j, )=p(6 2 , •) will not imply 0j=0 2 - In the 
formal development of CSI, m in (3.1c) is unrestricted and can even be infinite. In practical 
applications, usually 

m « dim Y , (3. Id) 

a restriction which permits use of the observed data vector y <0) to test the hypothesis that py 
really does belong to the family p (6, •) (Kendall and Stuart, 1979, Ch. 30). 

The observer’s goal is to use y (0) to predict P numbers 2 j, .... z P . Let z := (2 j, .... z P ) be the 
"prediction vector," a member of the prediction space Z=R r . Nothing is known about z except 
that 

z = G(S 0 ) + f (3.le) 

where G : ©— »Z is a known function, 6 0 is unknown but is known to be one of the (possibly 
many) parameter vectors 6 such that 

Py=P(0,-), (3. If) 

and C is an unknown vector about which no information is available except that 

C € S z (3.1g) 

where S z is a given subset of Z. The vectors^ in (3.1a) and £ in (3. 1 e) can be thought of as unk- 
nown systematic errors. 
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The CSI solution to the problem (3.1) is to look for a subset K z cZ, which is small in 
some useful sense (to be discussed later) and which has a high probability of containing the true 
prediction vector z. This idea of probability requires careful examination because it is not quite 
the usual one axiomatized by Kolmogorov (1950). Unlike Bayesian inference, CSI assigns pro- 
babilities to statements rather than to subsets of Z, and the number assigned to a statement is not 
unique. Typically, CSI produces a statement X, a number p g (0, 1], and a proof that either X is 
true or some event E has occurred whose probability wasp. If p « 1, the observer usually will 
feel safe in accepting X as true. The value of p is the "failure rate” for X, and 1 -p is the 
"confidence level" for X. The statement X is said to hold with failure ratep or at confidence level 
1 -p . An observer who always accepts statements when they have failure ratep will be wrong, in 
the long run, in a fraction of cases approximately no more thanp. The nonuniqueness arises 
because if X is true with failure rate p , then obviously X is true with every larger failure rate. 
Ideally, one would like to calculate the greatest lower bound of the failure rates for X, but this is 
often a very difficult calculation. 

The calculus of failure rates is somewhat different from the ordinary calculus of probability. 
To illustrate this, suppose that Xj and X 2 are statements with failure ratesp x and p 2 . Let £,• be the 
event with probability p t which must have occurred if X,- is false. Letp 1 a p 2 denote the smaller 
of p 1 and p 2 . If Zj implies X 2 , then failure of X 2 means that Xj is false, so E x has occurred. 
Therefore X 2 has failure rate p j. Consequently, if Xj implies X 2 , then Z 2 has failure rate p 1 Ap 2 
as well as rate p 2 . The statement "Xj or X 2 ” is false whenever both Xj and X 2 are false. (Here 
"or” has its technical, logical meaning.) Then both E } and E 2 must have occurred. The probabil- 
ity of this compound event could be as high aspj Ap 2 , but not higher, so "Xj or X 2 ” holds with 
failure rate <pj Ap 2 . The statement "Xj and X 2 " is false whenever one of Xj or X 2 fails, i.e., 
whenever one of £1 or £ 2 occurs. The probability of this compound event is no greater than 
p ! +p 2 , so ”X, and X 2 " holds with failure rate <p 1 +p 2 . If X! is certainly true, then p j = 0, and as 
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special cases of the above calculations, "X, or Xj” holds with failure rate 0 (in fact it is certain), 
"Si and 1^' holds with failure ratep 2 , and if Xj implies X 2 then X 2 holds with failure rate 0. In 
fact, if X! implies X 2 and X! is certain, then obviously so is Xo. 

The goal of CSI is to construct for each p e (0, 1 ] a set K z (p) c Z , and to prove that if z is 
the true prediction vector then 

zeK z (p) (3.2) 

with failure rate <p. It is desirable that the set K z (p), called a confidence set for z, be as small 
as possible, in a sense to be discussed later. The proof that (3.2) holds with failure rate <p will 
make heavy use of the rules of inference described in the preceding paragraph. 

Given p e (0, 1], the observer begins the construction of K z (p) by choosing for each 0 e © 
a set/£ y (p,0)e L r such that 

p(6,K Y (p,0)) > 1 — p . (3.3a) 

For each 6 e ©, the set K r (p ,0) can be chosen in any way whatever, subject only to (3.3a). 

Now the 0 O which appears in (3.1e) solves (3. If), so p (0 O , ) is the tree probability distribu- 
tion p Y . Therefore the probability that y ^ e K } (p ,0 0 ) is at least 1 — p , so the statement 

y < O) G * r (p,0 O ) (3-3b) 

holds with failure rate <p. By hypothesis, 

77 e 5 r . (3.3c) 

Since (3.3b) has failure rate <p and (3.3c) has failure rate 0, the conjunctive statement "(3.3b) 
and (3.3c) are both tree” has failure rate <p +0=p. But the truth of this conjunctive statement, 
together with (3.1a), implies 

y< o > e /r<p,0o) + S 1 '. (3.3d) 

Therefore (3.3d) has failure rate <p. The offending event E with probability <p which must 
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occur if (3.3d) fails is the event that (3.3b) is false; i.e., 

y<°>e T \A" r (p,0 o )- (3.3e) 

Next, for each ye f the observer defines a set, namely 

A~V,y):= [6 :0e 0 and y e K Y (p,6) + } . (3.30 

Clearly A^(p, y)c 0. Furthermore, 6 e K^(p, y) iff (if and only if) ye K Y (p,0)+S Y . There- 
fore (3.3d) is true iff 

6 0 e A‘ 6 (p,y (0) ), (3.3g) 

so (3.3g) has failure rate <p . Finally, the observer defines the set 

K z (p) := G (K% , y<°>)) + . (3.3h) 

Statements (3.1e) and (3.1g) are certain and (3.3g) has failure rate <p. If K z (p) is defined by 
(3.3h), then (3.2) is a consequence of the simultaneous validity of (3.1e), (3.1g) and (3.3g). 
Therefore, by the failure rate rules of inference, when K z (p) is defined by (3.3h) then (3.2) holds 
with failure rate <p. The offending event with probability <p which must occur if (3.2) fails is 
(3.3e). 

The observer has great freedom in choosing K y (p, 6), the only restriction being (3.3a). 
This freedom can be exploited to make the set K z (p) in (3.3h) as small as possible. But what 
does "small" mean for a subset of Z if dim Z > 1? No length or volume has been defined on Z, 
and "small" may mean different things for different components z,- , since those components may 
be measured in different physical units. If dim Z = 1, the "size” K z of a subset K z c Z is easy to 
define because Z is simply the real line R . A convenient definition is 

| K z | := *4 (sup K z - inf K z ) (3.4a) 

where sup K z is the supremum or least upper bound of K z and inf K z is the infimum or greatest 
lower bound of K z . The quantity 2 1 K z \ is called the "diameter" of K z because 
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2\K Z | =sup { Im-wI : u,w e K z ) . (3.4b) 

Therefore \K Z | might be called the "radius" of K z . Definition (3.4b) could be extended to the 
case dim 2 >1 if | u | were defined in that case, but it is not. A useful consequence of (3.4a) is 
that if K z and K z are any subsets of R then 

\K\ +K% | = |ATf | + \K% | , (3.4c) 

with equality, not merely inequality. 

Bayesian inference and stochastic inversion lead naturally to prediction spaces Z with 
dimZ>l, but in confidence set inference it is both possible and desirable to consider only 
dimZ = 1. The goal of any inversion of data is numerical prediction. The point of considering 
the predictions z x z P together as a single prediction vector z=(Z],...,z/>) is that for any func- 
tion h :Z-*R a confidence set for the numerical prediction h(z) is obtained easily from the 
confidence set for z. If z e K z (p ) with failure rate p , then clearly h (z) e h (K z (p )) with failure 
rate p. But trying to treat all functions h:Z^R simultaneously has a cost. It forces a 
compromise in choosing the K r (p,6 ) for (3.3a): none of the confidence set radii \h(K z (p))\ 
should be immoderately large, whatever the function h :Z ->R. Clearly, for any particular h , 
the smallest value of \h(K z (p))\ is achieved by ignoring other functions h' when choosing 
K y ( p,0) to satisfy (3.3a). For any particular prediction h (z), K r (p,6 ) should be tailored to the 
particular function h :Z —>R. But then the observer is simply replacing the function G : 0 — »Z 
in (3.1e) by the function g :©-»/?, where g -h o G , the composite of h with G . Thus in CSI 
the observer is really dealing with the case dimZ = 1 in (3.1). This point of view has been made 
practical only by the advent of inexpensive and powerful computers. Before them, economy 
required that the data be "reduced" to a multidimensional model and its error estimates. All indi- 
vidual numerical predictions were then computed from the model. 
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There may remain cases where an observer really needs a multidimensional prediction for 
itself rather than as an intermediate step toward various one-dimensional predictions. The present 
paper does not discuss these cases. Henceforth, it will be assumed that dim Z = 1, so Z -R , and 
the function G : © — » R in (3.1c) will be written g :Q-^R, while z will be written z and £ will 
be written £. Since the codomain of g is R , g is a functional, the prediction functional. 
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4 CONFIDENCE SET INFERENCE IN GEOPHYSICAL INVERSE PROBLEMS 

Most geophysical inverse problems, even nonlinear ones, permit considerable simplification of 
the general formalism for CSI given in section 3. To see this, it is necessary to state the typical 
geophysical inverse problem at an almost pedantic level of precision, as follows. 

The observer has measured the components of the data vector y (0) =(>’i (0) ,...,y^ 0) ) in the 
data space Y =R ° , and wants to use y (0 * to predict the value 2 of a certain numerical property of 
the earth. As discussed at the end of section 3, the prediction is only one real number rather than 
several. The tools available to the observer are these: (1) an arbitrary set X called the model 
space, whose members x will be called earth models: (2) a function F :X ->Y called the data 
function; (3) a functional g :X ->R called the prediction functional; (4) a known subset S Y of Y ; 
(5) a known subset of S z of R ; (6) an m -parameter family of probability distributions on Y . The 
case m =0 is permitted, and then the "family" will consist of a single known probability distribu- 
tion p on Y. If m >0, the m parameters will be collected into a single parameter vector 
0 : =((?],... ) in the parameter space ©=R m , and the probability distribution with parameter 
vector 6 will be written p (6, •). The probability which p (6, •) assigns to the subset Q c Y will be 
written p(6,Q). All the p (6 , •) must have the same domain Z Y , a cr-ring of subsets of Y. More- 
over. Z y must be closed under various operations of Euclidean geometry. Rather than list these 
in detail, the present paper assumes simply that Z Y consists of all subsets of Y to which a 
Euclidean volume can be assigned, i.e., the Lebesgue-measurable sets (Halmos, 1950), so Z Y will 
include all open and all closed subsets of Y. 

The information available to the observer is the following: a vector Sy was drawn at random 
from a population in Y whose probability distribution is p E . This p E may be unknown but is 
known to belong to the given m -parameter family. That is, there is at least one 6 e © such that 

Pe=P@’). (4.1a) 

The observer also knows that there are vectors xe X , 7] e Y and a real number £ such that 

7] e S Y , (4.1b) 
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£eS z , (4.1c) 

y<°> = F(x)+ n+8y, (4. Id) 

and 

z=g(x)+£. ( 4 - le ) 

In (4.1), x will be called a correct earth model, tj will be called the systematic error in modelling 
the data with x, £ will be called the systematic error in modelling the prediction w'ith x, and Sy 
will be called the random error of observation. 

The geophysical inverse problem (4.1) is a special case of the statistical inference problem 
(3.1). To see this, introduce the following definitions: 


0:=0xX (4.2a) 

§ := (d, x) (4.2b) 

y (0) :=F(x) + Sy (4.2c) 

g(6):=g(x) (4.2d) 

and, for every Qely and every' (0,x) e 0, 

p(P,Q):=p(e,Q-F(x)). (4.2e) 


To use this correspondence between (4.1) and (3.1), the "parameter vector" 6 must be permitted 
to be a member of an arbitrary set 0, not necessarily an m -tuple like (3.1c). The assumption 
(3.1c) was never used in section 3, so abandoning it does no harm, and now (4.2b) makes sense. 
In practice, X is usually a linear space or a manifold of some dimension N , and then 6 in (4.2b) is 
exactly of the form (3.1c) with m =m+N . In this special case, the desirable but not essential 
condition (3. Id) becomes 

m + dim X « dim Y . (4.3) 

The formal application of CSI to (4.1) requires neither (3.1c) nor (4.3), but (4.3) does permit tests 
of the statistical hypothesis (4.1a), that Sy was drawn at random from a population described by 
some member of the family p (6, •)• 
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In principle, the family p(6, ) can also depend on x, but the simpler problem (4.1) covers 
most geophysical inversions. In that simpler problem, some of the flexibility available in choos- 
ing a K r (p,6 ) is not very’ useful, and can be sacrificed to the goal of economy. To construct the 
special kind of K r (p, 6) appropriate to (4.1), for each 6 e 0 choose a set K } (p,0) e Y. y such 


that 

p(0,K Y (p,0))Zl-p (4.4a) 

and define (in the notation of A.3d) 

K Y (p,e):=F(x) + K Y (p,0) (4.4b) 

where 6 = (0,x) as in (4.2b). Then (3.3f) becomes 

K^(fi,y)‘= {(6,x):6 e Q and xeX and ye F(x) + K Y (p,6) + S Y ] . (4.4c) 

Now, since dim Z = 1, (3.3h) is written 

K z (j>) = g(K*(p,y (0 )) + S z (4 Ad) 

and, at confidence level £ 1 -p or with failure rate £p 

zeK z {p). (4.4e) 


The description of K z (p) can be simplified further. For any subset W c @ define n x (W ; ) 
as 

n x (M~) := {x : x e X and there is at least one 6 e © such that (6 , x) e W } . (4.5a) 

Heuristically, n x (W) can be thought of as the shadow of W cast on X by light rays parallel to ©. 
According to (4.2d), for every' subset W c 

g(tf) = g(ri*(W)). (4.5b) 

Define K x (p , y) c X as 

K x (p,y):=U x (K%,y)). (4.5c) 

Then the various definitions imply that 
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K x (p,\) = [x : x € X and there is at least one 6 e 0 such that y e F(x) + A' r (p,6) + S r } . 

(4.5d) 

Furthermore, from (4.5b) and (4.5c), 

s (K % . y (oy )) = g(K x (p> y (0) )) ( 4 - 5e ) 

so finally 

K z (p) = g (K x (p , y (0) )) + 5 Z <4.50 

and, with failure rate <p , 

z e K z (p ) . (4.5g) 

The observer still has great freedom in constructing the sets K Y (p,6) to satisfy (4.4a), and 
this freedom can be used to minimize | g (K x (p , y (0 ^)) | . By (4.5f) and (3.4c), the result will be to 
minimize \K z (p)\. To achieve the minimum possible value of | g (K x (p , y (0) )) | can require 
intricate statistical theorj’; the present paper will try to make | g (K x (p , y (0) )) | small, but not 
always as small as possible. 

5 LINEAR INVERSE PROBLEMS WITH KNOWN ERROR STATISTICS AND AN 
INJECTIVE DATA FUNCTION 

The solution (4.5) to the inverse problem (4.1) is that z e K z (p) at confidence level > 1 -p, or 
with failure rate <p, where K z (p) is given by (4.5f). This "solution" is incomplete, because no 
particular choice has been made for the sets K Y (p,6 ) in (4.4a). The present section shows one 
way to choose the sets K y (p,6 ) and thus complete (4.5f) to an explicit solution, in what is 
perhaps the simplest inverse problem of all, the linear problem with known random error statis- 
tics and an injective data function. 

In this simplest version of problem (4.1), X is a linear space, F : X — » Y is a linear injection, 
g :X ->R is a linear functional, and the observer knows p E , the probability distribution in Y for 
the random error vector 5\ which appears in (4. Id). 
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As noted in appendix B, p E induces a unique dot product on Y , and this makes Y a finite- 
dimensional Hilbert space. Therefore p E can be discussed in the language of tensors on Hilbert 
spaces, as set out in appendix C. The expected value of Sy, E [<5y], is known. The observer can 
redefine y (0 ^ as y (0 *-£ [5y], thus achieving the result 

E[6y] = 0. (5.1a) 

Then the variance tensor of p E is simply E [(5y)(<5y)], and in terms of the dot product on Y deter- 
mined by p E , 

E[(8y)(5y)]=I r (5.1b) 

where Iy is the identity tensor on Y . 

Let PpQ q and Q E q;) be the orthogonal projectors of Y onto F(X) and its orthogonal com- 


plement, ^(X) 1 . Then 

P fq:) (F(x)) = F(x) (5.2a) 

and 

Q F( X)(.E (.*)) = 0 (5.2b) 

for every x € X . Applying Pfqc) 30(3 Qf(X) 10 (4. Id) gives 

^V(X)(y ( °^) — F(x) + P E qc )(77 ) + Ff(X )(<5y) (5 .2c) 

Qf qc )(y<°>) = Qf(X)(T1) + Qf(x )(<5y) • (5 .2d) 


Now F : X Y is an injection, and thus so is (F | X ) : X F (X ). But F \ X is also a surjec- 
tion, and hence is a bijection. Therefore, (F|X) _1 :F(X)->X exists. For brevity, this inverse 
function will be written simply F -1 , so 

F~ 1 :F(X)-+X. (5.3a) 

Note that F -1 ( y) is defined only for ye F(X), not for all ye Y. Therefore, F -1 could not be 
applied to (4. Id). But (4. Id) has been split into the two pieces (5.2c) and (5 .2d), and F _1 can be 
applied to (5.2c) to give 
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X = x (°) - F~ x o P F(X )(ri) - F- 1 o P F0C) (Sy) (5.3b) 

where 

x (0) = F" , o P F(X )(y {0y ) (5.3c) 

and F~ l o P FQL) is the composite (A.la) of F~ x with P F (X) . Substituting (5.3b) in (4.1e) gives 
z =z (0) + C-g of-'o P F Qcin)-S O F~ ] o PpQ;)(Sy) (5.3d) 

where 

z (0) = g (X<°>) = g o F~ l o P FQ( )(y {0) ) ■ (5.3e) 


In this simplest inverse problem, (5.3d) contains the genm of the solution (4.5e,f). Since p E 
is known, there is no parameter vector© in (4.1a), and m, the number of parameters, is zero. 
Then in (4.4a), K r (p,0 ) does not depend on any 6, so (4.4a) becomes the requirement that a set 
K y {p) c Y be chosen so that 

p E <JC Y (j>))>\-p . (5.4) 

To motivate the particular choice of K r (p ) to be suggested here, consider the linear functional 
g o F~ x o P F (x) ‘Y Since Y is a finite-dimensional Hilbert space, so is its subspace F (X). 
Therefore there is a unique vector ye F(X), such that for all ye F(X), go F~ l (y)=yay. But 
then, for all ye Y, 

g o F~ x o P F( x)(y)=ray (5 .5a) 

because g o F~ x o P F (x)(y) =7 □ P F (X)(y) - p fqc)(Y) □ >’ =7 □ It is also true that 

II; o F -1 || = ||rll (5.5b) 

where ||y|| := (y ay)* 4 and ||g o Z 7-1 !! is defined as in appendix C. Hence, iff := y / |jy!| , then 

y=||goF~ 1 ||y (5.5c) 

and 

llfll = 1 - (5.5d) 


September 19, 1988 



George E . Backus 


Confidence Set Inference 25 


Then 

go F~ l o P F (X)( Sy) = Il£ o F _1 ||(y □ Sy) . (5.5e) 

Now consider the real random variable y □ Sy. Its probability distribution, p -, is the margi- 
nal distribution of p F on the one-dimensional subspace of Y spanned by y. If V is any 
Lebesgue-mcasurable subset of the real line then 

Py(y)=p E (yr+{r) ± ). (5.6a) 

The mean and variance ofyci5y are 

E\yoSy] = 0 (5.6b) 

and 

F[(ya5y) 2 ] = l , (5.6c) 

because F[yo5y) = yci£[5y] = yoO = 0, and £[(yo5y) 2 ] = £[(y o5y)(5yoy)] = 
E [y D(5y)(5y) ay] =yoE [(8y)(8y)] cy = ycl r ny = yay= 1. 

To construct a K Y (p) in (4.4), use the notation of appendix A for open and half-open inter- 
vals. For any v g [0, «), define 

p(y,v)=p-([v,«))+p-((^o,-v]) . (5.7a) 

For simplicity, suppose that p~(V)= 0, whenever V contains only one point, and that p-(V)> 0 
whenever V is an open interval. Then 

p (f. 0) = 1 , (5.7b) 

and as v increases from 0 to °°,p(y,v) decreases monotonically from 1 to 0. Therefore the func- 
tion p(y, •) : [0, »o) — » (0, 1] has an inverse function, p(y, -) _1 : (0, 1] — » [0, °°), which will be writ- 
ten v(y, ■) : (0, 1] — » [0, «>). The probability is p that \yoSy | >v(y,p). This motivates the 
definition 

K r (p):= {y:ye Y and |y ny| <v(y,p)} . (5.7c) 

Then 5ye K Y (p) with probability p . Therefore 
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jfoSyl <v(f.p) (5.7d) 

at confidence level 1 -p, or with failure ratep. 

Now for any positive a e R , let 

B z (a) t= [z :z e R and \z\<a). (5.8a) 

Clearly 

\B z (a)\=a (5.8b) 

in the sense of (3.4a). Also, from (5.5d) and (5.7 d), 

Sor'o P fq() (5 y)e £ z (||g oF- 1 ||v(f,p)) (5.8c) 

with failure ratep. Therefore, from (5.3d), 

z e K z (p) (5.86) 

with failure ratep , where 

K z (p) = zW + S z +g o F~'o P F(X p Y ) + B z (\\goF- 1 Mr,p))- (5.8e) 

Often, S z and S r , and hence g oF~ 1 oP F( ^ ) (.S r ), will be symmetric about the origin; that is, 
S z ~-S z and S Y =-S Y . In this case, the notation (3.4a) provides a simple way to restate 
(5. 8d,e), namely 

1 2 -z (0) | < |5 Z | + lg oF-'oP P(X) (S r )\ +||goF- 1 ||v(f,p) (5.8f) 

with failure ratep. 

Of the three terms on the right in (5.8f), | S z | will be easy to evaluate because a symmetric 
S z is usually defined simply by giving |5 Z |. For the second term on the right of (5.8f) it is 
always true that 

|g oF~ l oP F (x)(S Y )\ < Hg oF -1 || |S r | . (5.9a) 

where 

|S y | : = sup {|| y H : y e S r } . 
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The inequality in (5.9a) is equality if S Y is a ball, i.e., if there is a real / 3 such that S Y =B Y (/?) 
where 

B y (P) := [rj :tj e Y and Unll ^p] ■ (5.9b) 

However, (5.9a) will not be the best estimate if, as often happens, S Y is a parallelipipcd. In this 
case there is a^ = (/?i P D )e Y such that S Y =C Y (P) where 

C Y (P) := tn :ij e Y and |t?/ | < \p t | for »' = 1,...,D). (5.9c) 

If S Y =C Y (P) but the Syi are correlated, then the edges of S Y are not mutually orthogonal in Y, 
and calculating | g o F~ l o Pfqc)(Py ) | becomes a problem in linear programming on a graph with 
2° vertices in N dimensions, where N = dim X . 

It remains to discuss the last term on the right in (5.8f). The easiest way to evaluate 
||g o F -1 || is to use the identity (5.5b). To evaluate v(y,p), one must do the integrals with respect 

to dp E in Y required to evaluate p* in (5.6a), and then one must cany out the integrals over the 

real intervals in (5.7a), to evaluate p(y,v). The inverse function ofp(f, •) is v(y, •)• In one spe- 
cial case, all these calculations can be done almost in closed form. If p E is Gaussian, then p - is a 
Gaussian distribution on R with mean 0 and variance 1. There is only one such one-dimensional 
Gaussian, so p- does not depend onf, and (5.7a) becomes 

oo 

p(v) = J d^e~'^ . (5.10) 

V 

The inverse function, v(p), is tabulated; a partial list of its values appears in Table 1. Rational 
approximations are given by Abram owitz and Stegun (1964, p. 298). 

+-H-+++++++Table 1 near here+++++++++++ 

The solution (5.8d,e) or (5.8f) has a simple interpretation in terms of least-squares fitting. If 
V is any subspace of Y, and ye Y, then the projection 7V( y) is that ve V which minimizes 
lly — v|| (Halmos, 1958). This is the ve V which provides the best fit to y in the sense of 
weighted least squares, the weight matrix being the inverse of the variance matrix of the random 
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errors (see appendix B). Thus the x (0) of (5.3c) is that member of X such that F (\ (0) ) gives the 
best fit to y (0) in the sense of weighted least squares. The z (0) of (5.3e) is the value the prediction 
would have if x (0) were actually the correct earth model and there were no errors. 

Of the two equations (5.2c, d) into which (4. Id) was partitioned, only (5.2c) has been used 
so far. If F(X)-Y, then <2 f(A') = 0 - and (5.2d) has no content. On the other hand, if 

dim X « dim Y (5.1 1) 

then, since m- 0, (4.3) holds, and it is possible to use the data vector y (0) to test the hypothesis 
that Sy was drawn at random from a population whose probability distribution is p E . In the sim- 
plest case, | Sy | « 1 so IK2 f(X)(* 7)II « 1 in (5.2d). Since any component of Sy has variance 1, it 
follows that Q F qc)( 7 1) can be neglected in (5.2d). Then that equation asserts that 2f(x)(y (0) ), the 
residual part of y® which remains after removing the best least squares fit, is itself a random 
sample from the population Qf(x)(Sy). The probability distribution of this population is the mar- 
ginal distribution of p E on F (X ) i . Many tests of this hypothesis are available in the extensive 
literature on tests of fit (see, e.g., Kendall and Stuart, 1979, Ch. 30). As an example, suppose that 

p E is Gaussian. Let N =dimX =dimF(X), and let {$ N+ \ Sd ) be an orthonormal basis for 

F (X ) x . Then the D -N numbers Si ny (0) with i -D +1 N should be independent random 

samples from a one-dimensional Gaussian population with mean 0 and variance 1. There are a 
variety of tests of this hypothesis (Kendall and Stuart, loc cit), one being the Kolmogorov- 
Smimoff test. Another test uses the fact that 

s 2 := (D -N)~ x £ (Si oy (0) ) 2 (5.12a) 

i= A r +1 

is the sum of D -N independent identically distributed random variables, so the probability dis- 
tribution for s 2 is nearly Gaussian (the exact distribution for (D -N)s 2 is chi-square with D -N 
degrees of freedom). The mean of s 2 is 1, and its variance is 2 (D -AO -1 , so the probability is 
nearly 1 -p that 

|s 2 -l | <v(p)[2/(D -N)]' A . (5.12b) 


September 19, 1988 



George E. Backus 


Confidence Set Inference 29 


That is, if Pe is Gaussian then (5.12b) holds with failure rate approximately p. In (5.12b), v( ) is 
the inverse of the functionp(-) defined by (5.10). 

6 USING A PRIOR QUADRATIC BOUND TO MAKE THE DATA FUNCTION 
INJECTIVE 

The preceding section is based on three crucial assumptions: that the observer knows the proba- 
bility distribution pe for the random error Sy in the data vector that F :X —>Y and 
g :X —>R are linear; and that the data function F :X —*Y is an injection. The present section 
shows how to relax the third assumption, so as to treat realistic linear inverse problems in which 
dim AT =«». Section 5 cannot treat such problems directly because if F is injective then 
dim X = dim F (. X ). Since F (X ) c Y , it follows that dim X < dim Y = D < when F is injective. 

When dim X =<» (or more generally, when F is not injective even though dim X <°°) the 
idea is to approximate the inverse problem (4.1) by another problem with an injective data func- 
tion. To generate this subproblem, it is necessary to invoke prior information beyond that con- 
tained in the original problem (4.1). In the present section, the new prior information will be a 
quadratic bound (1.4). The observer knows a constant q and a positive definite quadratic form Q 
on X ; and the observer knows that there are vectors x € X , tj e S Y and a scalar £ € S z such that 
x,77 ,£ satisfy (4.1) and x also satisfies (1.4). 

Appendix B shows that q~ x Q defines a dot product on X. The dot product of x and x will 
be written xox, and ||x||, the norm (or length) of the vector x, will be defined as ||x|( = (xox)* 4 . 
Then (1.4) can be written as 

II x || SI. (6.1) 

Even knowing (6.1), the observer cannot reduce (4.1) to a problem with an injective data 
function unless certain further information is available about the quadratic form Q which appears 
in the prior bound, (1.4) or (6.1). Specifically, the bound (1.4) must be a strong enough constraint 
on x that it makes the given data function F :X ->Y and the given prediction functional 
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g :X -+R continuous in the nonn ||x|| which q ] Q defines. That is, lim || x„ -x|| =0 must 

n — >oo 

always imply lim ||F(x n )-F(x)|| = 0 and lim |g(x,,)-£(x)| =0. Since F and g arc linear, 

n — ►«> n — 

continuity at x = 0 is equivalent to continuity at every' xe X (Halmos, 1951). And since F and g 
are linear, continuity at x = 0 is equivalent to boundedness (see appendix C for definitions and 
Halmos, 1951, for proofs). 

When dim A <°°, F and g will always be continuous because they are linear, but when 
dimA=~ linearity does not imply continuity (Halmos, 1951). Continuity of F and g with 
respect to the norm ||x|| is an extra assumption about that norm. When dim AT =«> and F is not 
continuous, CSI requires the observer to reexamine her or his prior knowledge of the earth and try 
to replace (1.4) with another such inequality strong enough to make F continuous. Backus 
(1970b) describes some methods ("quellings") for doing so. When dim A' = <» and g is not con- 
tinuous, the observer has two choices: finding a stronger Q for (1.4), or accepting a slightly dif- 
ferent prediction z A for which (4.1e) can be replaced by z A =g A (x)+Ca> with a new prediction 
functional g A which is continuous. This is the tactic adopted by Backus and Gilbert (1968, 
1970). 

Besides assuring that F and g are continuous, which may require reexamining the prior 
geophysical information, the obsen’er must also make sure that X is a Hilbert space with the dot 
product defined by the quadratic form q~ l Q. This is merely a mathematical technicality, and 
requires no new geophysics. To say that the dot product space AT is a Hilbert space is merely to 
say that it is complete in the norm )|x)|; that is, all its Cauchy sequences have limits. This is a 
trivial requirement because if X is not complete it can always be completed (Halmos, 1951). 
Therefore it can be assumed without loss of generality that A' in (4.1), (6.1) is a Hilbert space 
with the dot product defined by q~ l Q . In the present paper, it will be assumed that X is separable 
(Halmos, 1951). This assumption is not essential, but does simplify the notation, since it 
amounts to assuming that all orthonormal bases for A' can be indexed by the integers. In every 
Hilbert space inverse problem known to the author, X is separable. 
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Since X is a Hilbert space and g is a bounded linear functional on X , there is a unique vec- 
tor g g X such that 

g(x) = gox (6.2a) 

for every x g X . Moreover, if || g|| := (g □ g) w , then 

llsIMIgll (6.2b) 

where ||g || is defined as in appendix C: 

||g|| := sup { | g (x) | : x e B x (1) } , (6.2c) 

the ball B x (a) being defined for any real a as 

B x (tx ) := { x : x g X and ||x|| <a } . (6.2d) 

Also, since X and Y are both Hilbert spaces and F :X —>Y is bounded and linear, there is a 
unique tensor Fe7®X (see appendix Q such that 

F(x) = Fnx (6.2e) 

for every x g X . Moreover, 

HF||=||F|| (6.2f) 

where, as in appendix C 

||F|| := sup { ||F(x)|| :xg F x (l)) (6.2g) 

and 

||F|| := sup { |y dF dx| : y g F r (l) and x g F x ( 1) ) . (6.2h) 

Now the inverse problem (4.1), (6.1) can be written as follows: 

p E =p(0,) (6.3a) 

77 G S Y (6.3b) 

c G S z (6.3c) 

y(°) = Fox+77+5y (6.3d) 

z=gcix+£ (6.3e) 
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||x|! < 1 . (6.30 

In the present section, the number m of parameters in the parameter vector 6 is 0, so (6.3a) sim- 
ply means that p E is known. The precise statement of the inverse problem (6.3) is this: the 
observer wants to estimate z from what is known. What is known is p E , S Y , S z , y (0) , F, g, the 
fact that 8y was drawn at random from a population with probability density p E , and the fact that 
there is a triple (x,T7,£) which satisfies (6.3). 

The observer can approximate (6.3) in many different ways by a problem with an injective 
data function, and so is free to choose an approximation which minimizes the error in estimating 
z . Section 7 will describe one technique for minimizing this error. The present section discusses 
how to find all the approximations to (6.3) which have injective data functions. 

To produce any such approximation, let N be a positive integer, and let be any N - 
dimensional subspace of X. It will not be necessary to assume that dimX =«, but that is the 
most interesting case, so the notation will be tailored to it. Define 

X (0 ^:=X N (6.4a) 

and 

X (A'.oo) := X /$~ . (6.4b) 

Since X is a Hilbert space, 

X = X (O^r ) © X as) (6.4c) 

This conventional notation for orthogonal direct sums is described in appendix D, and simply 
» 

abbreviates the following remarks. For every xe X there are unique vectors ^(o^) and 

i£(N,.)S *(*,-) such that 

x = X (0f y ) + X(S> )0e ) . (6.4d) 

These vectors have the additional property that 

x (oa) D X (N ,«) = 0 (6.4e) 
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so that 

l|x|| 2 = l|X( 0 jv)||" + ||X(/v „)|| 2 . (6.4f) 

To approximate F and g , define 

:= F I X (0 //) (6.5a) 

F (Nr ) := F | (6.5b) 

8 ( 0 fr ) := 8 I ^(tyv) (6.5c) 

8(N,~) := 8 l^(W,~)» (6.5d) 


and let the corresponding tensors and vectors be F (0 ^-)G Y ®X^y F^^e Y ® X (A , 
gpjeXp) and g(/v.~) € Now an apparent ambiguity of notation must be resolved. 

There are two ways to construct g (0>A ) and g(N,«.) from g: the construction g Kg in (6.2a) fol- 
lowed by gh(g(ojv).g(w,«)) as in (6.4d); or the construction g H(£ ( o.n>£(a\~)) of (6.5c,d) fol- 
lowed by g(ojv)l-> g(o^) and _„)!-> gyi,-) 35 in (6.2a). Obviously both constructions give the 
same result, so in fact the notation is unambiguous. 

Since F and g are linear, F (x)=F (x^^t^) + = F^^x^//)) + F ^ and 


similarly for g . Therefore 

F ox = F( 0>W ) □ XfQ'H) + F (^ i00 ) □ X( Woo ) (6.6a) 

and 

g □ X = g(O^T) D X(ojv) + g(N ~) d x> . (6.6b) 

Substituting the expressions in (6.3d, e) suggests the following definitions: 

V (n,«) : = F^,.) □ X( W +77 (6.7a) 

C(N,~>) : = g(N.«.) o X(^,«) + C (6.7b) 

Sjt.-) : = F (N .. ) (B X(W -- ) (l)) + 5 y (6.7c) 

Sf A » := £<a» (B^O)) + S 2 (6.7d) 

where 15 X(N, “ ) is defined by replacing X with X i#o) in (6.2d). Note that 

8(F.~)(B X( "~Kl)) = B Z (llg(A'.~)ll ) (6.7e) 
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where B z is defined by (5.8a). The preceding definitions permit (6.3) to be recast in the follow- 
ing form: 


Pe =p(P,) 

e $ (A' ,») 

C (N ,«) e 

y (0) = F(o^) □ *(o;v ) + 1 (N ,«) + 5y 
2 - g(0,N ) D x (0A r ) + ^ (W .->) 

I|X(0^)II ^ 1 . 


(6.8a) 

(6.8b) 

(6.8c) 

(6.8d) 

(6.8c) 

(6.80 


To obtain (6.8) from (6.3), note that (6.8a) is (6.3a) and that (6.8d,e) are obvious consequences of 
(6.3d, e), (6.6), and (6.7a, b). The bound (6.81) follows from (6.31) and (6.41). Assertions (6.8b, c) 
follow from (6.3b, c) once it is shown that F (Noo) dX( A . i0o) (£*(l)) and 

g ( n (!))■ These claims are established by noting that (6.31) and (6.4f) 
imply ||X(N t0e) ||<l. 

The problem (6.8) is formally identical with (6.3), but now the data space is X ( 0 ,n) with 
dimX(P' N )-N <°°, and the data function and prediction functional are F^y.X^^Y and 
g(pji)-X(ojj)-*R’ These are approximations to the original F and g , and the errors of approxi- 
mation, the truncation errors, are included in (6.8b, c). 

Nothing in the foregoing construction of F (0 ^) assures that it will be injective. However, it 
can always be restricted to be so. Let N (0iA >) be the null space of F^y That is, 


N(o^) *•= { x : x e X ^ ) and F^-fii) — 0 } . (6.9a) 

Then (F( 0>A | N(o,n)) y is necessarily injective. To prove this, it suffices to prove that 

F ((W | N ( o^) never assigns 0 to a nonzero vector in its domain (Halmos, 1958). But if x e 
and F (0>A 7)(x)=0, then also x e N so x is orthogonal to itself, and hence x=0. 

Now let 

M = dim (6.9b) 

and in the argument leading from (6.3) to (6.8) use 
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Xm=N(U ( 6 - 9c ) 

instead oiX N . Then A , (0 j^)=X w =N ( oa') and tf> e new F (0iW ) will be an injection ofAT (0>W) into T. 
Thus section 5 becomes applicable. The numerical calculation of and F (Qh1 ^ from 
and Fw ) can 1x3 done 9 uite explicitly, since F (0M) = F (0A >) I A' ((W) = F | For any vector 

xe X ( 0 ^y the vector F (0<N) (x) in Y can be written (F^fx) Ff 0rN) (x)). For any 

/ e {1 D } the real number F[ 0 ^)(x) depends linearly and continuously on xe X^y There- 

fore F{o/j ) : A '(o//)— » F is a continuous linear functional on A r (o/jy Hence there is a unique vec- 
tor F/o^)6 X (0tN) such that 

F , (o//)(x) = F( 0iW )ax (6.9d) 

for every xe X^y The functional F( 0fW) will be called the /th restricted data functional and 
F('o^) will be called the i th restricted data kernel. The space N^) is spanned by the D restricted 
data kernels F ( ’ 0//) F^ y 

One particular choice ofX N will assure from the outset that F^y.X^^Y is injective, 
so that the extra step described in the preceding two paragraphs is unnecessary. Let N F be the 
whole null space of F , that is 

N> := {x : x e X and F(x) = 0 } . (6.10a) 

Writing F(x) = (F 1 (x) F° (x)) defines the D data functionals F‘ : X — and (6.2a) provides 

the corresponding data kernels F‘ e X (Backus & Gilbert, 1967). The space N p is spanned by 
F 1 ,...,F°, so dimN/<D. Take N = dim Kp and 

X N = Np . (6.10b) 

When this X N is used to obtain (6.8) from (6.3), clearly F (0iW ) —» Y will be an injection. 

Choosing X A - =N A avoids the extra step (6.9) in the approximation process for (6.3), and 
has the further advantage that with this choice 

F (NiOo) = 0 (6.10c) 

and (6.7c) becomes 
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S[ A , H :=S y . (6.10d) 

In other words, (6.10b) produces the smallest possible truncation error (namely 0) among all 
choices of subspaces of X. However, (6.10b) has one serious disadvantage: it requires at least 
some numerical computation with D xD full matrices, and D is often very large. For satellite 
magnetic data, D is typically of the order of 10 4 . Another problem is almost certain to appear in 
(6.10) but can also arise in (6.9). The data functionals F 1 , ...,F N in (6.10) or F 1 , ...,F M in (6.9) 
may be linearly dependent or nearly dependent. Then a very small subset of them may carry all 
the information in y (0) which exceeds the noise, tj +Sy. Computations with the whole linearly 
independent set of data functionals will be numerically ill-conditioned and wasteful of computer 
resources. All these questions are addressed in the next section, which shows a way better than 
(6.9c) to choose X M , whether X N is arbitrary or is given by (6.10). 
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7 RESOLUTION IN CONFIDENCE SET INFERENCE 

When dim X =«> and the observer accepts a prior quadratic bound on the correct earth model x, 
the choices (6.9c) and (6.10b) provide two approximations to the linear inverse problem (6.3). 
Each approximation provides a finite-dimensional model space X, an injective data function 
F :X —> Y , and known confining sets S Y and S z for the systematic errors f] and £. Therefore 
both approximations are amenable to treatment by section 5. However, both approximations are 
potentially subject to a serious numerical defect. Even though F~ l :F( X)—>X exists in both 
cases, ||F _1 || may be so large that ||g oF -1 || is too large to make (5.8d,e) or (5.8f) a useful esti- 
mate of z. Since F is linear, all that is required to make F injective is that F(£)*0 for all 
| e 3£ x (l), where 3B X (1):= {* :£e X and ||x|| = l). Nothing prevents F (£) from being very 
small for some such so that ||F -1 || is large. Physically speaking, if £e 3fi x (l) and F{f;) is 
smaller than the error T} +8y, then whatever component £ ox of | is present in the correct earth 
model x, that component will contribute to y (0) a "signal" which is below the noise, because (6.31) 
requires |£ ox | < 1. Then, even though F(£)*0, it is fruitless to try to resolve | ox and use it in 
estimating z . 

In Bayesian inference and stochastic inversion, the computations required to pick out the 
resolvable parts of the earth model x in problem (6.3) are well-known (Jackson, 1979; Gubbins 
and Bloxham, 1985; Backus, 1988a). The present section describes the corresponding calcula- 
tions of resolution for confidence set inference (CSI). 

The first step is to use the prior quadratic bound to approximate (6.3) by a finite- 
dimensional problem (6.8). There is no restriction on the X N used here, and either (6.9c) or 
(6.10b) is acceptable. It will be assumed, however, that 

N = dim X N < dim Y = D . (7.1a) 

A more complicated notation would make (7.1a) avoidable, but the effort is of doubtful value, 
since (7.1a) holds in most practical inversions, and certainly in (6.10b). 
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Resolution in CS1 is based on the eigenstructure or singular value decomposition (SVD) of 
the finite -dimensional approximate data function F^) : A'^)— > Y . The eigenstructure is avail- 
able because both X\ 0tN) and Y are Hilbert spaces; and the eigenstructure is physically useful 
because the dot products on X and Y arise from real physical information about the correct earth 
model X and the random error 8y. Appendix D describes SVD in the form most convenient for 
the present discussion. 

Let the eigenfactors or singular values of F (0 ^) in descending order be 
<f>j >tp 2 > ■ ■ ■ ><p N >0 (7.1b) 

where multiple eigenfactors are repeated according to their multiplicities. Let and 


— »5W ) be orthononmal subsets ofX (0 jv) and Y such that 

F(0, W)(*i)=<PiSi ( 7 - lc > 

for i = 1 N. Since N =dimX {0 ^ ) , {fc lt } is an orthonormal basis foTX ( p /J) and can be 

extended to an orthonormal basis for all of X , written , $ 2 * ' ' ' ) • For each integer i define 

gi : = g □ (7- Id) 

and 

Xi := x □ x, (7.1e) 

so that 

x= (7. If) 

i=i 

and 

g =£*,*,•• (7-lg) 

1=1 


Here x and g are the earth model and prediction functional appearing in (6.3). 

Calculation of resolution is facilitated by the following notation. For any integers p and q 

such that 0<p <q , define as the subspace of X spanned by [il p+ Define 

X(p^)= (0), and let be the subspace of X spanned by {x /)+1 ,^ / , +2 , • • • }. For the special 
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cases X (0 > n) and X (N>oc) these definitions agree with those already introduced in (6.4). If 

0<p <q define 


r (p,q) r 1 ^(p.q)' 

(7.2a) 

8(p,q) ' = $ 1 ^ (p,q ) . 

(7.2b) 

r (p,q) F(X(p,q)) < 

(7.2c) 

Px (pq) ’ 

(7 .2d) 

the orthogonal projector of X onto i<?) ; 


P Y (p,q) : = 

(7.2e) 

the orthogonal projector of Y onto Y {p, q y 


x (p,q ) ,= ^(p.?) ( x ) • 

(7.2f) 

and 



(7.2g) 

This notation leads to a number of useful identities. 

some of which are expressed most 

easily in the tensor language of appendices C and D. If 


0 <p £q ^ 

(7.3a) 

then 


? 

X (p ,?) = £ x i x i 

(7.3b) 

i=p+ 1 


<7 

6( P*q)~ S Si^i 

(7.3c) 

i=p+ 1 


and 


<x 

<x 

ll 

Ch 

(7.3d) 

/=/?+! 



where a sum is understood to vanish if its upper limit is less than its lower limit. If 

0 < p < # < N (7.4a) 

then 
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P£, ?) = £ fcfc (7.4b) 

l=/?+l 

and 

Fo,. 9) = £ */&*.• • (7.4c) 

/=/>+! 

If 0 <p < <7 and q is such that 

<t> q > 0 (7.5a) 

then (7.1b) implies that F is a bijection whose inverse F^ q ) is represented 

by the tensor 

F^)= £ (pf'SiSi • (7.5b) 

/ =p+l 

Here has as its domain Y^^y but F^) can be interpreted as a member of either 

X(p,q)®Y or X ®Y . Equation (7.5b) is simply (D.lle) in appendix D. It is also important 
to observe that (7.5b), (7.4b) and the orthonormality of {^i Ss ) imply 

P(p. 9 ) ^ P(p>?) = P"(p.?) • (7.5c) 

If n is any integer satisfying 

0<n <N , (7.6a) 

then in the standard notation described in appendix D 

X( 0 ,N) = ^( 0 ,n)®^(nJV) (7.6b) 

X(n ,«c) =r ^’(n/.') ® ^(N,») (7.6c) 

(7.6d) 

Now' the idea is to approximate (6.3) by each of the X ( 0i „) for n = 1 N and to try to select 

n to minimize \Kfos,)(P )\ • (be radius of the confidence interval with failure ratep for the predic- 
tion z . For each n in (7.6a), definitions (6.7a, b) become 

V (n ,~) : = F („ □ X ( „ + 77 (7 . 6 e) 

C( n ~) : =g(».~) DX (n.~) + C • (7.60 
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If n is small enough so that 

</>„> 0 (7.7a) 

then / r (o,n):X(o.n)-» Y is an injection, and section 5 can be applied verbatim to the linear, finite- 
dimensional injective inverse problem whose terms in (4.1) are X :=A' (0/l) , F:=F (0n) , 
8 := 8 ( 0 ,n )• x :=x ( 0 , n ). *7 : =*7<n,-). and £=£(«,-)• T he solution, (5.3d) or (5.8e), can be slightly 
simplified because now X (0in ) as well as Y has a dot product, and tensors are available. Using the 


tensor notation, (5.3c,e) become 

0-7b) 

and 

*©r=6<o,,)°*g>. t7.7c) 

Using (7.5c) makes it possible to write the solution (5.3d) in the form 

2 =2 (o!n) + C(n,~)“ 8(0,n ) ° ) □ (7J („ „) + Sy) . (7.7d) 

This last equation can be slightly simplified. Clearly F(n,m) = F{n//) + F&i,~) and. from 
(7.4c) and (7.5b), 

F(o!n) ° F (n ^-) = 0 , (7.8a) 

so 

F(o!n) o F(„ = F ( o|„) d F (N ^ . (7.8b) 

Therefore, from definition (7.6e) applied to n and N 

F(o!«) OTJ (« ,~) = F(o|b ) □ *? (A\~) (7.8c) 

and (7.7d) can be replaced by 

2 =z (S?fi) + C(f! ~) “ 8 ( 0 ^i ) D F ( o| n ) □ ( 77 ( A , i00) + 8y ) . (7.8d) 


For the particular choice X N =N/?, i.e., (6.10b), clearly F (A . iOo) = 0, so in (7.8d) rj(N ,~) can be 
replaced by T], the original systematic error in (4.1b) and (4.1d). 
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For anyp € (0, 1) a confidence interval K 'f 0 ,«)(p) can be written down from (7.8d) such that 


z g Kf 0 , n) (p) (7-9a) 

with failure ratep. Here z is the correct value of the prediction. To construct K^ n) (p), first 
define 

Y(0.n ) : = 8(0 ,n ) □ F (0.n ) ' 9b ) 

and 

f(o. n ) : =r ( o.n)/llr(o. n )l! • ( 7 - 9c ) 

Let 

v„(^) = v(y ( o^),p) ( 7 -9d) 

as calculated from p E via (5.7). Then (5.8c) and (7.5c) permit the assertion that, with failure rate 

P. 

1 g( 0 .«) □ F^) o Sy | < Hr(o.n)ll V «<P) ■ ( ? - 9e ) 

For any set Q c Y and anyy g Y , define 

r°Q : = {r°y ; y e Q ) • 

Then (7.8d) implies that with failure rate <p 

2 6 2 (0/») _ 7(0.n) D S(N.») +^ Z (lly(0^)ll v/ n( D )) (7.9f) 


where v„(p ) is given by (7.9d), y (Qn) by (7.9b), and B z by (5.8a). Using (6.7d,e) to evaluate 
Sf n<00 ) shows that the set on the right of (7.9f) is 

Kfo„)(P) := 2 (0.1) +S Z +fi Z (lig(„,»)ll + (p ) lly ( o,« >11 ) - 7<o„ ) □ S [„ . (7.9g) 

Thus, when Kf 0A) (p) is defined by (7.9g), (7.9a) holds with failure rate <p. Usually, S z and 
Sj N<co) are symmetrical about the origin, and then (7.9a) and (7.9g) together are equivalent to the 
assertion that, with failure rate <p , 

I z-z^ n) | < |5 Z | +||g (B .»)|| +v(n,p) ||r ( o.„)ll + ly ( o.n ) ci5f N . ) | . (7.9h) 
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The foregoing calculation can be repeated for any n satisfying (7.7a), and the idea is to 
choose n to minimize the right side of (7.9h). Therefore it is necessary to understand how the 
terms in (7.9h) vary with n . In the interest of simplicity it will be assumed that 


(7-lOa) 

the ball defined by (5.9b). Then, by (7.8c), 

|y ( o,„ ) D5f B ,. ) |=||r ( o^)ll/3 (7.10b) 

so (7.9h) becomes 

\z-z$ n) \<\S z \+T p (n) (7.11a) 

where 

T p {n) := ||g ( „.«)|| +l!y (0 ^)ll *„(p) (7- lib) 

and 

K„(p):=p+v(n,p). (7.11c) 


Let M be the largest n such that <p„ > 0. The value of T p (n ) can be calculated for 
n =0, 1, , . . . , M , and the minimum can be chosen by inspection. 

Why might the observer expect to find values of n e {0, ...,M ) for which T p (n) is substan- 
tially less then T p { 0) or T p {M)l A rigorous general proof seems unattainable, but a suggestive 
discussion can be given. From (7.11b), 7'p(0) = |lg(o~)ll = llgll- In this case, the estimate of z in 
(7.11a) comes entirely from the prior information (6.30- The data vector y^ contributes nothing. 
If the data are at all relevant to the prediction, then T p (n) should decrease initially as n increases 
from 0 and the data begin to contribute to the estimate of z. But why should T p (n ) start to 
increase again before n reaches M ? 

The reason lies in a characteristic property of most non-trivial linear inverse problems. In 
an ideal problem, all possible data could be collected without error, and would determine a 
unique earth model x. In this ideal case, both the model space X and the data space Y are infinite 
dimensional Hilbert spaces, and the data function F :X ->Y is a linear injection. In the non- 
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trivial ideal problems, however, F is compact (Kato, 1976), so that — >0 as n — >«>. In a real 

inverse problem with dim Y < but with Y large enough to give some hope of approximating 
the ideal problem and obtaining useful limits on the prediction, the observer should expect to find 

<p n « 1 (7.12a) 

as n increases toward M, and indeed this is commonly observed (Gubbins and Bloxham, 1985; 
Langel and Estes, private communication). 

To see more clearly how T p (n ) will vary with n, introduce the abbreviations 
R(n):= || g (n H || 2 and 5 (n ) := ||y ( o„ )ll 2 - Then from (7.3c) 

R(n)= £ g 2 (7.12b) 

/ =/I +1 

and from (7.4c), (7.5b) and (7.10b) 

S(*)=2>fV. (7.12c) 

;-i 


Furthermore, 

T p (n) = R(n)' A + S(n)'\ n (p). (7.12d) 

Define 


D p (n) := g- 2 [T p (n-})-T p (n )] . (7.13a) 

Then, if the data are relevant, one can expect D p (n)>0 for small n. If T p (n) does have a 
minimum, then eventually D p (n ) will become negative. But a simple calculation gives 

D p (n ) = [R(n -if + R(n f) -<pfx n ip) [S in - 1 f + S(n f ] . (7.13b) 


Thus D p {n) > 0 exactly as long as 

r,,(p)[f?(n- lf+Rjn)^ 
0„[S(n-l)*+S(n)*] 


(7.13c) 


Clearly, if <M , 
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R (fl -1) 1/4 + R(n)' A >2 1| g(A/,~)ll 
while (7.ib) assures that 
0J5(«-l) w + 5(/ l )^<2||g (O>W) ||. 

Except for distributions (5.6a) with pathologically long tails (e.g., a superposition of two Gaus- 
sians with quite different variances) it will happen that v(y (0n yp)> 1 for the interesting values of 
p (see Table 1). Then K n (p)>\, and the right side of (7.13c) will always be at least 
II £(\f ,oo)ll I llg(o^)ll- When n becomes so large that 

0»<llg(M.~)l|/||g(O,M)ll (7.13d) 

then D p {n)< 0, and T p (n ) will increase with n . The minimum value of T p (n ) will certainly have 
been passed. 

The foregoing calculation of a confidence interval for z makes no explicit attempt to 
enforce the constraint (6.1), and the n which minimizes T p (n) may not satisfy 

||x (0<n) ||<l. (7.14a) 

As a result, the confidence interval K z (p) is conservatively long, and could be shortened by 
enforcing (7.14a) as a further constraint. It is not clear whether the required calculations are 
worth doing. In practise, q in (1.4a) will not be known with any precision, and therefore the CSI 
estimate of | K 1 (p ) | will not inspire much confidence unless 

llx (0 , n) ||«l (7.14b) 

at the value of n which minimizes T p (n ). The cautious observer will compute ||x (0n )|| and 

check (7.14b) as an essential part of CSI. For values of n near M , even (7.14a) is likely to fail, 
but such values of n are of no interest in estimating z . 
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8 ESTIMATING THE PROBABILITY DISTRIBUTION OF THE RANDOM ERRORS 

In sections 5, 6, and 7 it was assumed that the probability distribution p E for the random error 
vector 8y is Known. In fact, under certain conditions p E can be estimated from the data vector 
y (0) . The conditions arc that p E be known to belong to a given m -parameter family of distribu- 
tions p(0, •); that there be a model space X such that F :X -+Y is an injection; that (4.3) holds; 
and that |S y | « 1. These conditions must often be verified a posteriori, because |S r | depends 
on p E , that is, on the unknown value of 0. The optimal choice X =X (0 „) in section 7 will also 
depend on this unknown 0. 

The recursive character of the foregoing statement of conditions under which p E can be 
estimated from the data is one result of the nonlinearity of the problem of estimating 0. It is 
difficult to make general statements about nonlinear inverse problems, but one class of problems 
admits a general solution by the method of maximum likelihood. This is the class in which all 
the probability distributions p(0, •) in the given family have density functions f (6, ): Y —>R. 
That is, for any Lebesgue measurable subset Q of Y, 

p(0,Q)= jf(6,y)dy 1 ■ ■ ■ dy D , (8.1a) 

Q 

a result usually abbreviated as 

dp ( 6 , y) =/ (6,y)dy ] • • • dy D . (8.1b) 

Then the probability distributions (4.2e) all have densities given by 

dp(e,y)=f(0,y-F(x))d > ’ l • • • dy D , (8.1c) 

where 0 = (6 , x). The method of maximum likelihood estimates 0 and x from y^ by choosing 
0(°) and x<°> to maximize / (0,y^—F(x)). The estimate is useful when 

A = dim Y - dim X - m (8.1d) 

satisfies X » 1. The A in (8. Id) is called the number of degrees of freedom in the inverse prob- 
lem. If / (0,y<°)-F(x)) does have a unique maximum (0 (O) ,x (O) ), then and x (0) are random 


September 19, 1988 



George E. Backus 


Confidence Set Inference A1 


variables. Under mild conditions on/, the following facts are known (Cramer, 1946, ch. 33). 
Suppose that X » 1 and that for each 0 € 0, p (0, •) makes 8y \, .... Sy D independent; or alterna- 
tively, suppose that all the p(0,) are Gaussian. Non-diagonal correlation matrices £[(5y, )(«5y ; )] 
are permitted in the Gaussian case. The random variables 0 (O) and x (0) are approximately Gaus- 
sian, with variance matrices of order X~ l and means correct to order X~' A . From these facts, 
confidence sets K y (p,6 ) can be constructed as in (3.3a), using (4.2b). The author has not yet 
carried out this program for any m > 1 , but believes that the outcome will be the same as in the 
case m = 1 to be described below. That is, for failure rates p which are not too small in a sense 
determined by the A of (8. Id) (see (8.5a)), there will be a positive p\ «p and a confidence set 
K e (p , y (0) ) with two crucial properties: (1) the true 0 will be a member of K e (p , y (0) ) with failure 
rate p j; (2) all the p (0, •) with 0 e K e (p, y (0) ) will be nearly the same. Then p E can be taken to 
be any one of the p (0, •) with 0 € K Q (p, y^), and this known p E can be used in sections 5, 6, and 
7 with failure ratep —p \ to produce a confidence set for z with failure rat ep. 

This phenomenon is visible in one common formulation of the geomagnetic modelling 
problem, where m = 1. Here the data function F : X — » Y and the prediction functional g :X —>R 
are linear, and the probability distribution p E is believed to belong to the one-parameter family 
p (0, •), all of whose members are Gaussian on Y with mean 0 and with variance matrices which 
are 0 2 times the known variance matrix of p (1, •)• This latter need not be diagonal. The usual 
statement of this problem is that the unknown parameter 0 is to be estimated along with the earth 
model x (0) . In CSI, estimating 0 and z are separate problems, and neither requires the other, 
although the scheme presented here begins by producing an estimate for 0 whose failure rate is 
much less than the failure ratep desired for z . 

To begin the estimates, note that if Q is any Lebesgue-measurable subset of Y then 

p{e,Q)=p(.\,e~'Q). (8.2a) 

If h : Y ->R, then E e [h (<$y)] will denote the expected value of h (<5y) under p (0, •). That is, 
£ e [/i(<5y)):=|rfp(0.y)^(jO. (8.2b) 

Y 
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Then 

£0[<5>’;] = ° (8.2c) 

and, because of (8.2a), 

E e [(Sy,)(S yj )] =e 2 E ] [(5y i )(5y j )] . (8.2d) 

Let (yoy) e denote the dot product defined on Y by p(6, •) as in appendix B. Then (8.2d) makes 
clear that 

(y ojOe =0~ 2 (yoy)i (8.2e) 

and, in an obvious notation, 

lly|le=0 -1 llylli (8-20 

In particular, if p (1, •) makes 5^ a unit vector in Y , then p(6,-) will make 6$ i a unit vector, which 
prompts the abbreviation 

$e=e$ l (8.2g) 

The advantage of (8.2e) is that the notion of orthogonality in Y does not depend on 0. 
Therefore the orthogonal projectors of Y onto F(X) alnd FfX) 1 are independent of 0. All of sec- 
tions 5, 6, and 7 can be carried out for 0 = 1, and the corresponding results for any 0 can be 
obtained by using (8.2e) to introduce appropriate powers of 6 in the various terms in sections 5, 6, 
and 7. For example, if 0, (0) is the /th eigenfactor of F :X — in the sequence (7.1b) when 
p E =p(6, •), then 

0i(0)=0"V/( 1). (8-2h) 

For any positive 0, let s(9) 2 denote the observed value of the random variable (5.12a) calcu- 
lated under the assumption that p E =p(6, •). Then (8.2e) implies 

s(0) 2 =0' 2 j( l) 2 (8.3a) 


and (5.12b) becomes 

|0 - 2 j( 1) 2 -1 | <v(p,)2»(D -AT* , 
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If p E really is p (6, •), then (8.3b) is true with failure ratepi. Then so are all its consequences, 
including the assertion that 

8 < s (1)[1 -v(p ,)2 \D . (8.3c) 

If p E -p (8, •), then (5.7d) becomes 

l(y e oSjOel <v(fi 2 ) (8.4a) 

where v(') is the inverse of the functionp(-) defined by (5.1 1), and fe is as in (8.2g). Then (8.2e) 
and (8.2g) show that (8.4a) is equivalent to 

l(fi o£y)i | <0v(pf) . (8.4b) 

Since (8.4a) holds at failure ratep 2 , so does (8.4b). 

Confidence sets can be combined by the rules given in the paragraph following (3.3f). 
According to those rules, at a failure rate no larger than 

P=Pi+P 2 (8.4c) 

both (8.4b) and (8.3c) are true. Therefore, at a failure rate no greater thanp , 

I o <5y)i I <v(p 2 )i (1)[1 -v(p i)2 V -NT' a T' a . (8.4d) 

Ifp is fixed, (8.4d) will be satisfied with failure rate <p for any choice of p x and p 2 satisfying 
(8.4c), so p ] and p 2 can be chosen to minimize the right side of (8.4d), and then (8.4d) can be 
used as in section 5 to obtain K z (p), a confidence set for the prediction z at failure ratep. 

In fact, unless (D -N)' A » 1, the v(p j) in (8.4d) really should be calculated from (5.10), 
the Gaussian density being replaced by the density appropriate to the chi-square distribution with 
D -N degrees of freedom. Here it will be assumed that (D -N)' A » 1. Then, if the observer is 
willing to accept a failure ratep large enough to permit 

v(p) « (D -N)' a , (8.5a) 

it will be possible to choose pj so that 

Pi «p (8.5b) 
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and yet 

v(p x )«{D-N )' A . (8.5c) 

With this choice ofpj, 

p ~p 2 (8.5d) 

and with an error of order v(p j) (D (8.4d) becomes 

| (y, □ 5y)j | <v(p)s(l). (8.5c) 

For any 6 satisfying (8.3b), (8.5e) is equivalent to 

I (ft □$’),! <v(p)0 (8-5f) 

with an error of order v(p X )(D By (8.2e) and (8.2g), (8.51) is the same as 

I <&> o <5y)e I <v(p). (8.5g) 


Therefore, except for a fractional inaccuracy of order v(p x )(D -N) v % the statement (8.5g) is 
correct with failure ratep if 6 is any parameter value which satisfies (8.3b). 

In other words, if the observer will accept a failure ratep which satisfies (8.5a), then he or 
she can estimate 8 by choosing any value which satisfies (8.3b), and can use p£ =p (0, •) to define 
a dot product on f. All of sections 5, 6, and 7 can then be invoked with the p E , and the 
confidence sets thus calculated will be in error by factors of the order of 1 +v(p j)(D -A 7 ) -1/i . 

Certainly there are interesting problems where dim © > 1. For example, the distribution of 
data errors may be the sum of two Gaussians, one with much smaller amplitude and larger vari- 
ance then the other, so that there are outliers in the data. Alternatively, the mean (5y) may not be 
known to vanish. If nothing is known about (<5y), of course the data are useless. Often, however, 
the unknown mean can be parametrized by a family of distributions for which dim © « dim Y , 
and then the mean can be estimated from the data. This is probably true, for example, of the orbit 
errors in satellite geomagnetism, although the method has not been tried there. When dim © > 1 , 
the estimation of 6 is more complicated then the one-dimensional illustration considered in this 
section. For example, the dot product in Y will depend on 6 . Further work is required to discuss 
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definitively the general case, that dim © > 1. 


9 THE GEOMAGNETIC EXAMPLE 

As an application of confidence set inference, consider the inverse problem described in section 
2: to estimate the geomagnetic field B at the CMB from measurements of Cartesian components 
of B at and above the earth’s surface. 

The infinite-dimensional model space X consists of all the magnetic fields (2.1) outside the 
CMB whose sources are inside the core. Only prior quadratic bounds of the form (2.5) will be 
considered, so the dot product in X of the two fields B and B with Gauss coefficients/? /"(a) and 
is 

BnB :=^" 1 £c(/) £ /?/ m (a)/3r(a). (9.1) 

/-I m— / 

To study whether the data function F :X -» Y and prediction functional g :X ->R are con- 
tinuous under (9.1) it will be very helpful to have the particular orthonormal basis for X gen- 
erated by the individual spherical harmonics Y{". Let £/" denote the model field B all of whose 
Gauss coefficients vanish with the one exception that 

fir<fi) = q»CVf*. (9.2a) 

Then (9.1) evidently implies that 

Sm- (9.2b) 

By (2.1), if B □x ; m =0 for all l and m then B = 0. Hence the collection of all the £/" is an ortho- 
normal basis for X in the sense of Hilbert space. For any model B with Gauss coefficients /? j m (a), 
(9. 1) and (9.2a) imply that 

nr a B = q~' A C ( / ) M /?f (a) , (9.2c) 

SO 

B = q-' h £ C ( / )* X fiP(a ) x, m , (9.26) 

1 = 1 m=-/ 
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the series being convergent in the norm defined by (9.1) (Halmos, 1951). 

To study the continuity of the data function and prediction functional it will also be helpful 
to introduce the definitions 

E L := {£/" : 1 < / < L + 1 and -/ < m < / ) (9.3a) 

and 

X [L] := sp Z L . (9.3b) 

Here sp E L is the "span" of E L , the set of all finite linear combinations of members of Z L . The 
set X|^j is a subspace of X , and if L < «> then dim = L (L + 2). The subspace X M consists of 
all the series (9 .2d) which have only finitely many nonzero terms, and X is the closure of X [„] in 
the norm generated by (9.1). 

Now suppose a linear functional g :X —>R is given. Is it continuous? To find out, define 
gr>=q-' A C{lj A g(xr). (9.4a) 

If B e X\ L } then evidently 

S(B)=£ £ grPi m (a) (9.4b) 

/=] m®-/ 

where ft F(a ) are the Gauss coefficients of B. By the Riesz-Fisher representation theorem, 
(Lorch, 1962, p.63), the linear functional g is continuous iff there is a ge X such that for every 
Be A' 

g (B) = g o B . (9.4c) 

From (9.4a), it then follows that g (x/") = g □ x/", so 

g=<?*£c(/r» £ gnr- (9.46) 

/= 1 m =-/ 

Therefore 

llgll 2 = «? ZC(ir l £ (gD 2 - (9-4e) 

/= 1 m=-l 
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IfgeX, ||g||<~, so g is continuous iff the series (9.4e) converges. The coefficients g” required 
for this test can be obtained from either (9.4a) or (9.4b). If g is continuous, then (9.4b) holds for 
all B e X ; that is 

g( b)=£ s grpr(a)> (9-4i) 

l =1 m *=>-/ 

the series being absolutely convergent for all B e X . 

It is of great interest to try to use the data to reconstruct B r , the radial component of B, on 
the CMB. To do so, consider the various functionals g \ X —>R which give weighted averages of 
B r over the CMB, with various weighting kernels. It will suffice to consider axisymmctric ker- 
nels. To describe these requires more notation. If c e R , define 

S(c) := {r : r e R 3 and |r|=c}; (9.5a) 

then S(c) is the surface of the three-dimensional sphere of radius c centered on 0. If 
/ : S (1) — » R , define the average of / over 5 ( 1) by 

(f(f)> f :=(4^r)- 1 J d 2 tf{t) (9.5b) 

5(D 

where d 2 t is the element of surface area on 5(1). Every function G : [-1, 1] — >R produces an 
axisymmetric weighting kernel g(§) at each point ase S(a). The linear functional g(s):X — » R 
is defined by requiring for each B e X that 

g(§) □ B := (G (f • s)£ r (a f)) f . (9.5c) 

The functional g (§) has coefficients #/"(§) as defined by (9.4f). To find them, write 

G<ji)=ZG,P, Gu) (9.5d) 

/=i 

where P t (jj.) is the Legendre polynomial of degree / and 

t 

G, = (2/+1) J dji G(jj.)P,(ji) . (9.5e) 

-l 

The addition theorem for real scalar spherical harmonics (Erdelyi et al., 1953) permits (9.5d) with 
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jj. =f * S to be written 

G (f • S) = £ G, £ 17(f) TO (9.50 

/ =0 m =— / 

(recall that T/" are real). Then, from (9.5c), 

g(3) □ B = £ G, £ y ; m (s) (Y, m (f)B r (af)) r . (9.5g) 

7 =0 m =— 0 

Because of the Schmidt normalization of T/", (2.1) gives 

<y/"(f)£ r (a f)) f = (/+l)(2/+ir , /3 / m (fl ) • (9.5h) 

Comparing (9.5g) with (9.4f) leads to 

to = (/+D( 2/+ir i G / rrm . (9.5i) 

Two useful consequences of (9.5i) should be noted. By the addition theorem for scalar spherical 
harmonics, 

i \gr^)\ 2 =(M) 2 ai+ir 2 G, 2 ( 9 . 5 j) 

m =>— / 

and so (9.4e) becomes 

llg($)ll 2 = q £ (/+1) 2 (2/+1)- 2 C ( l r 1 G, 2 . (9.5k) 

/= 1 

As pointed out in section 6, CSI will fail if the prediction functional g \X -+R is discon- 
tinuous under (9.1). The weighted averaging functional g(s) defined by (9.5c) is continuous iff 
(9.5k) converges. If (9.5k) diverges, then the prior quadratic bound (2.5) is too weak to overcome 
the non-uniqueness of the prediction z in (4.1e) produced by the fact that dimX =«>. Then the 
particular average (9.5c) cannot be estimated from surface and satellite data with only the prior 
information (2.5). One special case of (9.5c) is obtained by putting the Dirac delta 

function, so G t =21+1. In this case, g(s)nB=B r (as) (Backus, 1986). It follows that the values 
of B r at single points a § on the CMB can be estimated from surface and satellite data iff (if and 
only if) the prior quadratic bound (2.5) satisfies 
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Z(l+l?C(l)- l < (9.6) 

/= 1 

For the energy bound (2.3), C(l ) = (/+l)(2/+l) -1 , and for the heat-flow bound (2.4), 
C(/ ) = (/+l)(2/+l)(2/+3)/ _l . For both these bounds, (9.6) diverges, so neither bound is strong 
enough to permit estimation of B r at the CMB from surface and satellite measurements of B. 
The best one can expect with those bounds is to obtain localized averages of B r . 

One such localized average is the unweighted average of B r over a disc of radius a radians, 
or aa km, centered on the point a § € S (a ). For this average, 

GQz) = (l-cosa) _l (9.7) 

if cos a <ti < 1 and G(jj.)=0 otherwise. When / » 1, Pj(cos 6 ) is asymptotically the Bessel 
function J Q [ 6 (l+ l / 2 )], so (9.5e) becomes (Abramowitz & Stegun, 1964, pp.336 and 361) 

G, =2a(l -cosa) -1 J^ail+Vi)} 

~ (1 - cosa) -1 (&ra// ) w cos [a(l + Vi) - 3tr/4] 

as / — ► Then (9.5k) converges for the heat-flow bound (2.4) but not for the energy bound (2.3). 

The heat-flow bound is strong enough to permit estimating uniform averages of B r over circular 
patches on the CMB, but the energy bound is not. If the energy bound is to be used to estimate a 
weighted average of B r over circular patches, the coefficients G t in (9.5d) must approach 0 faster 
than l~' A as / approaches infinity, so the kernel function G (ji) must be smoother than the boxcar 
(9.7). For example, G (ji) might be a tent function, so that G t behaves like r 3/2 as / -* «>. 

Now consider the data space Y . For simplicity it will be supposed that the data (2.2) consist 
of all three Cartesian components of B measured at D /3 locations r ; on and above S ( b ), the sur- 
face of the earth. The random errors 5y i of the individual components will be assumed to be 
Gaussian, independent, and identically distributed with mean 0 and variance 6 2 . Then for any 
data vectors y and y, their dot product is 

yiay = 0" 2 Xj, Si ( 9 - 8a ) 

i=i 
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or 


y □ y = 9~ 2 £ B(r, ) • B(r t - ) 


D/3 


(9.8b) 


i=i 


where B(r, ) and B(r,) are the two measured vector magnetic fields at the location r, . The dot 
product on the right in (9.8b) is the ordinary vector dot product in R 3 . 

The data function F :X —>Y is very simple in the geomagnetic example. If y=F(B) then 
the D components of y are the D Cartesian components of B(r) at the D /3 locations r f . In par- 
ticular, (9.8b) implies that for two models B and B e X, 

D/3 


F(B)uF(B)=0~ 2 ZB(r,)-B( ri ). 

i= l 


(9.9a) 


Therefore, 


||F(B)|| 2 =0- 2 |B(r,)| 2 . 

i= 1 


(9.9b) 


To use CSI, the observer must verify that the data function F :X -» Y is bounded (continu- 
ous) in the norms on X and Y defined by (9.1) and (9.8a). To prove this, note that for any r > a , 
(2.1) implies 


B (rf)= £ {air )' +2 £ p?{a) bftt) 

/= 1 m=~l 


(9.9c) 


where 


b/"(f) := -V [r w-1 y / m (f)] rml . 

Furthennore, by the addition theorem for vector spherical harmonics (Backus, 1988a, Appendix 
A), if f e S (1) then 


i 


I |b/"(f)| 2 = (/+l)(2/+l). 

m=-/ 


Therefore, by Schwarz’s inequality, 


(9.9d) 


Z /5/ m («)b/ m (f)| <(/+l)(2/+l) X I prw 

m =>— / m =>-/ 
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and hence, using Schwarz’s inequality once more, 

| B(rf) | 2 < q ||B|| 2 £ (/+1)(2/+1)C(/ )~\a/r) 2M (9.9c) 

/ =1 

where ||B|| 2 = BdB as defined by (9.1). If r >a and C(l ) is any rational function of /, then 
(9.9e) converges, so the linear function Bh-» B(r f) with domain X and codomain R 3 is bounded 
for any fixed rt with r > a . Then the data function F : X -> Y is also bounded, and in fact (9.9e) 
implies that 

\\F\\ 2 <(qD/3y9~ 2 £ (/+1)(2/+1)C(/ T\alcf^ (9.9f) 

/=i 

as long as | r 4 - 1 £ c > a for all observation points r, . In particular, ||/ r || <« and F is continuous 
when the prior quadratic bound is either the energy bound or the heat-flow bound. Both these 
bounds permit the use of CSI as long as the prediction functional is bounded. 

To cast the geomagnetic problem in the form (4.1), it remains to produce sets S Y c Y and 
S z qR such that if 77 and £ are the systematic errors in modelling the data and the prediction 
then the observer is confident that ij e S Y and £ € S z . For simplicity, it will be assumed here 
that the prediction z in (4.1e) is a numerical property of the magnetic field produced outside the 
core by currents in the core. Nothing else is involved in 2 . In that case, z is exactl y calculable 
from the correct model x, so £ = 0 and 

S z = { 0 } . (9.10a) 

The specification of S Y is more difficult, since tj includes contributions to the data from the man- 
tle, crust, ionosphere and magnetosphere. To treat the crustal contributions, one can model them 
(Langel, Estes, and Meade, 1982), but here they will be controlled by assuming that all the data 
are collected from satellites on or above the spherical surface S (c), which lies at an altitude c -b 
above S (b), the surface of the earth. Several workers (Lowes, 1974; Langel and Estes, 1982; 
Cain, Wang, and Schmitz, 1988) have used the satellite data to estimate the spatial power spec- 
trum of the geomagnetic field, as defined by Mauersberger (1956), Lucke (1957) and Lowes 
(1974). They find the spectrum’s dependence on spherical harmonic degree / to be well fitted by 
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a sum of two terms representing spatially white sources slightly below S(a ) and slightly below 
S(b). If the second tetm is interpreted as that produced by the magnetization of the crust, then 
the satellite data provide a bound on Tj c , the contribution to tj from the crust. At altitude 
c-b =400 km, the parameters of fit obtained by Cain et al. (1988) give the rms total intensity of 
the crustal field to be about 

u = 12 nT . (9.10b) 

If B is measured at D/3 points on or above S (c), then 

\\r] c \\<{uie)(pnj A . (9.10c) 

Estimates of the contributions to T] from the mantle, ionosphere and magnetosphere are 
more difficult, and will be neglected here. In practice, ionospheric and magnetosphcric effects 
have been minimized by selectively discarding data particularly sensitive to these effects. For 
example, in modelling the MAGSAT data, Langel and Estes (1985) used vector data only at 
geomagnetic latitudes below 50°. In the polar caps at higher latitudes they used only total inten- 
sity, which is less seriously affected than are the horizontal components of B by the field-aligned 
currents in the magnetosphere. Using total intensity makes the data function F :X ->Y non- 
linear, but it is nearly linear because above 50° geomagnetic latitude the radial component of B 
accounts for at least ninety percent of the total intensity. If only the crust contributes 
significantly to the systematic error, then (9. 1 0c) implies that one can take 

S y =B y (J3) (9.10d) 

where B Y is the ball defined by (5.9b), and 

/3=(u/0)(D/3) w . (9.10e) 

The foregoing discussion casts the geomagnetic problem in the form (4.1) with the prior 
bound (6.1). Now section 6 can be used to estimate the truncation errors produced when X is 
replaced by an N -dimensional subspace X N , so as to permit numerical calculations. Of course 
the answer depends on how X ^ is chosen. One choice (Backus and Gilbert, 1968) is to follow 
(6.10b) and take X N =N p. Then X N = sp (Fj, ...,F D }, F- being the data kernels defined by 
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(1.2a). This choice leads to no truncation error, so 

(9.1 Of) 

but it has the disadvantage that N is likely to be D or nearly so. Parker and Shure (1982) chose 
X N to be the span of a subset of {Fi,...,F £) ), chosen so that N «£> but with the observation 
points sufficiently well-distributed to make the truncation error small. Parker and Shure used 
numerical evidence rather than a rigorous theory to estimate the truncation error. Still another 
X N is obtained by triangulating S ( a ), giving B r at the nodes (vertices), and interpolating B r into 
each triangle from the vertices (R. L. Parker, private communication). The oldest choice of X N in 
geomagnetism, and the only one to be examined in the present paper, is truncation of (2.1) at 
some degree L. Thus X N will be one of the spaces X^ defined by (9.3b), so that N =L(L+ 2). 
The space A (A r >oc )=A^j will be written A' [L1 , and the functions :A' [L] -> T and 
p\L) . » Y will be defined by 

F lL] :=F \X [L] , (9.11a) 

F [L] := F \ X lL] . (9.11b) 

Thus and F [L j are the A' (0jN) and F (0i /v) of (6.5a), while A' [L) and F [L] are the X (N ^ and 
F>.~) of (6.5b). 

To calculate the value of ||/ r (A 7 _ 00 )|| needed in (6.7c), note that if B e then (9.9e) remains 
true when the lower limit of the sum over / is L + 1 instead of 1. Therefore (9.9f) becomes 

H-F^ll 2 < (qD/3)9~ 2 £ (l+l)(2l+l)CUr\a/cf^ . (9.11c) 

/=!+ 1 

Backus (1988a, appendix A) sums this infinite series when C{1 ) comes from the energy bound 
(2.3), and provides an upper limit when C ( / ) comes from the heat flow bound (2.4). The latter 
implies that for the heat flow bound 

\\F [L \<e-\qDI6)'\alc) L +' i [\-{alc) 2 r' /l . (9.11d) 

(In equation (A.16) and the inequality preceding (A.15) of Backus, 1988a, the factor 1+2 should 
be 2/+1. This misprint does not affect the subsequent calculations.) 
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Nothing is gained by taking L so large that the truncation error (9.1 Id) is several orders of 
magnitude smaller than the crustal systematic error (9.10c). The truncation error is less than ten 
percent of the crustal error when 

(cla) L *>(\Qlu){ql2)' A [\-(alc?T ' /l . (9.11c) 

With u = 12 nT, q = 3 x 10 17 nT : and c = 677 1 km, (9. 1 le) leads to 

LS27. (9.1 If) 

It remains to carry out the analysis of resolution described in section 7. For this purpose, it 
is necessary to find the eigcnstructure (singular value decomposition) of E (L] » Y, as 
described in appendix D. In general, this will require numerical calculations based on the actual 
locations of the D/3 observation points on or above 5(c). To avoid such calculations, and to 
obtain answers in closed analytic form, two assumptions will be made: 

2L + 1 «D' a \ (9.12a) 

and the D 13 observation points will be all on S (c ) and so uniformly distributed that the sum 
(9.9a) can be approximated by an integral when B and B e Then 

F [L) (B) d F [t] (B) =e~\D 13) < B(c f ) • B(c f) > f . (9.12b) 

This result is obtained from (9.9a) by assigning to each r,- e S (c ) a small patch of private terri- 
tory of area \7jtc 2 ID , the patches being chosen to cover 5(c) without overlap. If the r,- are not 
nearly uniformly distributed, the data B(r, ) can be weighted by areas of private patches which 
differ from one r 4 - to another, so as to achieve (9.12b) when (9.12a) holds. This complication will 
not be explored here. 

In particular, if / <L and I <L then (9.12b) applies to B =£/" and B =£/", whose Gauss 
coefficients are given by (9.2a). Lowes (1966) and Backus (1986) evaluate the integral on the 
right of (9.12b) in this case, with the result that 

F { L](«T)DF IL] (Sf) = 5 /f ^ (qD/3y9-\alcf^(l+l)C(l )~ l (9.12c) 

Equations (9.12c) and (9.2b) show that when L satisfies (9.12a) then Z L is an orthonormal 


September 19, 1988 



George E. Backus 


Confidence Set Inference 61 


eigenbasis for F lt j in The corresponding eigenfactor, 0/" = ||/ r (x / m )||/|Jx/ n ]|, does not 

depend on m ; from (9.12c) it is 

<f>, =e~\qDI3)\alc) u \l+\)' A C(l )~' A . (9.12d) 

The cobasis for in Y corresponding to the eigenbasis Z L consists of the vectors 

sr-*r'F(*r>. <9.i2e> 

Therefore, by (D.l le), 

£ *rsr ■ (9.i2o 

/=1 m**—l 

It remains to consider the spaces X( 0ilt )C X which appear in section 7, and to choose n to 
minimize the length of the confidence interval K^ 0jt ^(p) for z defined by (7.9a) and (7.11a). 
Because 0/" is independent of m , all the £/" with the same / will be treated together, so only 
spaces X( Bf0 )=X[/] defined by (9.3b) will be considered. Here 1 </ <L and n =1(1+ 2). The 
models g (0> „) and g (n «,) will be written g {/ ] and g l/ 1 so, by (9.4d), 

Bt/i-^^LCO')" 14 £ (9.13a) 

7=1 «**-./ 

and 

g 1 ' 1 ** 1 * £ COT* t sT*7" (9. 13 b) 

y*=/ + 1 

Therefore, 

llg t/] ll 2 = <7 £ COT 1 £ Is/ 1 ! 2 . (9.130 

In (9.13a,b,c), gj" is defined by (9.4b) rather than as a Gauss coefficient of g. Similarly, 
7 ( 0^1 ) : = g(o^)dF ( o^) will be written as 

7t/] = g[/]°Ff / 1 ) (9.13d) 

so that, by (9.120, (9.13a) and (9.12d), 
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r in =eO/D)» Z (c/ar 2 (J +!)-'* £ g?y? (9. 1 3e) 

>= i m =-y 

and 

llr[/]ll 2 = (30 2 /9) £ (c/fl^O'+D" 1 Z Isfl 2 - (9.131) 

1=1 m=-> 

In (7.11), with n =1(1+2), Kf 0n) (p), T p {n) and K n (p) will be written A" f, } (p), 7 p [/j 
andjq/jtp). Since5 z = {0} in (7.11a), it follows that 

I * f; ] (P) I = T p [ l ] = ||g l/ J || + Hri/jll JC„ ](p) . (9.13g) 

If L is chosen to make the truncation error (9.1 Id) one tenth of the crustal error (9.10c), the p in 
(7. 11c) is given by 

p = (l.l)(u/9)(D/ 3) w . (9.13h) 

Now the values of T p [ l ] for 1 <1 <L must be examined, to see which is smallest. 

Two extreme examples of predictions z will illustrate the foregoing calculations. First, sup- 
pose the prediction functional g :X —>R corresponds to the kernel g(s) defined by (9.5c). In this 
case (9.5j) converts (9.13c,f) to 

iig in n 2 =<7 z o'+tfcy+ir^/cor 1 ( 9 . 14 a) 

j=t+\ 

and 

||y l/] || 2 = (30 2 /D) £ (c/a) 2j+4 (J+\)(lj+\)~ 2 Gp . (9.14b) 

The sum (9.14a) can be computed once for all when l =L, and then only L =27 expressions 
(9.13g), for 1 < / <L, need be computed to minimize the confidence interval for z at failure rate 

P- 

In the other extreme example, z is a single Gauss coefficient of the true core field: 
z =P l”(a ). Then in (9.4b) gf = 8, j 8^ . One must take L > l , and then only the single subspace 
X [ / ] need be considered. Clearly 
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llg [/1 ll =0 (9.15a) 

\\ r[l] U=e(3/D)\l+ir'\c/a) ,+2 . (9.15b) 

Then (9.1 3g) becomes 

l*f/](P)l =(l+lT' A (c/a) U2 [(\.])u +(3 /D)»6v<y u] ,p)) . (9.15c) 

If the crust can be treated as a two-dimensional random process on S ( a ), then the systematic 
error in (9.15c) is only the truncation error, (0.1 )m instead of (l.l)u. The crustal error can be 
included in 5y with other random errors, and 6 in (9.15c) must be replaced by (0 2 +a 2 u 2 )' /l where 
a > 1, and a = 1 only if the crustal statistics introduce no correlations between the <Sy, at different 
measuring points (Langel, Estes, and Sabaka, 1988). In the most favorable case, a = 1, and 
(9.15c) is replaced by 

\Kf n (p)\ =(/+l)-V/a) /+2 [(0.1)M +(3 ID)'\e 2 + u 2 )'\6 {l] ,p )} . (9.15d) 

+++++++++++++++++++Table 2 near here++++++++-H-+++++ 

Table 2 gives \Kf; ] (p) I and |tff/j(p)| in microTeslas. The values quoted in the table are for a 
failure rate p = 10 -4 . Using the failure rate p = 10~ 2 does not change \Kf/ ] (p) | by an amount 
sufficient to affect the table, while that larger failure rate lowers | Kf t j (p ) | by about ten percent. 
For comparison, Table 2 also gives the rms value (ft /"(a ) 2 )'£ of the Gauss coefficients of degree 
l at the CMB, where 

(Pr(*) 2 ) m '•= (2/+1) -1 X griaf. (9.15d) 

m=-l 

The values of (/3/"(a) 2 )^ are from Langel and Estes (1982). In Table 2, u = 12 nT (Cain et al., 
1988), 0 = 6 nT (Langel, et al., 1982), and D =26,500 (Langel and Estes, 1982). The random 
errors are assumed to be Gaussian, so v(y { / j,p) is independent of 7 [; ] and is given by (3.9) or 
Table 1. 

The implications of the computation leading to Table 2 are the following: the heat-flow 
bound permits truncation at L =27 with a truncation error of 1.2 nT in each measured component 
of B. Suppose a failure rate p = 1 0 -4 is acceptable. If the satellite data on a sphere of radius c at 
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altitude c -b =400 km are fitted by least squares with a model (2.1) truncated at l =L, and if 
(9.10c) is accepted as a bound on the crustal field at satellite altitude (or as its standard deviation 
if it can be viewed as random) then each Gauss coefficient /J/"(a) on the CMB has a confidence 
interval j (p) or K^ t j (p) whose half-length is given by Table 2. These half-lengths do not 
depend onm, If the crustal error must be treated as systematic, column 3 of Table 2 is appropri- 
ate, and shows that Gauss coefficients with / >9 are not resolvable at the CMB. If the crustal 
error can be treated as random, with uncorrelated contributions at different measuring points, then 
column 4 of Table 2 is appropriate, and Gauss coefficients at the CMB become unresolvable 
when / > 12. If only Gauss coefficients of degree < / are known, the circle of confusion on the 
CMB has a diameter of about 4(/+l) -1 radians (Booker, 1969). For / =8 this is about 25° and for 
/ =11 it is about 19°. 

Notice that the confidence interval specified by (9.15c) or (9.15d) does not explicitly 
involve the value q =3 x 10 17 nT 2 from (2.4). That value enters only in choosing the truncation 
level (9.1 If). The relationship between q and the minimum acceptable truncation level which 
produces a truncation error no more than ten percent of the crustal error is (9.1 le). If q is 
changed from q x to q 2 ,L must change from L j to L 2 where 

L 2 -Lj = log(? 2 /<7i)[21og(c/a)r 1 • (9.150 

If q is increased by a factor of 3.8, L must be increased by 1. If q is increased by a factor of 
14.2, L must be increased by 2. Clearly the conclusions of CSI in this case are quite insensitive 
to the numerical value of q in (2.5). What is important is that the quadratic form (2(x,x) is 
known, and that the bound q is known to within a few orders of magnitude. 
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10 CONCLUSIONS 

The uniqueness question in geophysical inversion is the question of inference: to predict from D 
observed numerical properties of the earth, yi°K—,yD°\ not only values but error bounds for P 
other numerical properties of the earth, z x ,...,z P . Except in very special cases, infinite dimen- 
sionality of the model space X makes the uniqueness problem insoluble without prior informa- 
tion to supplement the data. Such information is available from laboratory experiments, physical 
theory, and other studies of the earth. Even if the prior information is not certain, accepting it as 
a working hypothesis is a useful way to approach the data. 

One particular kind of prior information is a quadratic bound (1.4a) on the correct earth 
model, usually a bound on energy or dissipation rate. In earlier studies, such prior information 
has been "softened" to a probability distribution p x on the model space X , so that stochastic 
inversion (SI) or Bayesian inference (BI) could be used to incorporate the prior information into 
the inversion. Recent work (Backus, 1988a) has shown that this "softening” process always adds 
spurious new prior "information" not implied by the original inequality (1.4a). Neyman’s (1937) 
theory of confidence sets provides a technique, confidence set inference (CSI), for directly incor- 
porating unsoftened hard prior bounds into the data inversion without generating spurious infor- 
mation. The constant q in (1.4a) is usually known only to within a few orders of magnitude, so 
an essentia] part of CSI is the analysis of the sensitivity of its conclusions to variations in q. 
With CSI the uncertainty in q cannot be dealt with by "softening” (1.4a) to a probability distribu- 
tion, as is done in BI and SI. 

CSI requires that the formal description of the linear inverse problem include the following 
ingredients, all explicitly known to the observer: a linear model space X, a positive-definite qua- 
dratic form <2(x,x) on X, a data space Y =R°, a prediction space Z =R P , a linear data function 
F :X — s- 1 7 , a linear prediction function G :X ->Z, a subset S 5 c Y, a subset S z cZ, and an /72- 

parameter family of probability distributions p(6, •) on Y, where 6 = (6 l 6 m ) is the parameter 

vector. All the p(6,f must have as their domain the cr-ring Z y of Lebesgue measurable subsets 
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of Y . The linear functions F and G must be continuous (and therefore bounded) in the norm 
defined on X by ||x|| = Q (x, x) w . 

To use CSI, the observer needs six more objects: p E , Sy, 0 O , x, H and C- These may be 
unknown, but the observer must know that they exist and that they have the following properties: 
p E is a probability distribution on Y with domain I r ; Sy = (5y Sy D ) is a vector drawn at ran- 
dom from Y according to the probability law p E ; and 


6 0 6 R m , (10.1a) 

x e X , (10.1b) 

tj e S Y , (10.1c) 

C*S Z , (lO.d) 

G(x,x)<l , (10.1c) 

Pe=P@ o.O. 00.11) 

y(°> = f(x)+T7 + 8y, (lO.lg) 

z = G(x) + £. (lO.lh) 


In (lO.lg), y (0) is the observed data vector (y ] (0) , ...,>'# 0) ) s Y ; and in (lO.lh), z is the desired pred- 
iction vector (zj, ...,z P )eZ. The vectors 77 and £ are the systematic errors in modelling the data 
and predictions, and 8y is the random error in the data. The vector x is the ’'corTect" earth model, 
and 0 O is the "correct" parameter vector describing the probability distribution of the random 
error vector 8y in Y . Neither x nor0 o need be unique. 

The final result of CSI is to produce for eachp e (0, 1] a "confidence set” K z (p)^Z such 
that either 

z e K z (p) (10.2) 

or an event E has occurred whose probability is no more thanp. The statement (10.2) is said to 
hold at confidence level 1 -p, or with failure ratep. Section 3 describes a rudimentary calculus 
which keeps track of the failure rates of the statements in any chain of argument whose 
hypotheses have known upper bounds on their failure rates. This calculus of failure rates greatly 
facilitates construction of the confidence sets K z {p ). Unlike BI or SI, CSI gains nothing by per- 
mitting dimZ > 1, so in the present paper Z is always the real line R , G :X — »Z is always a 
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functional# :X —>R, and A' z (p) is a subset of R . Therefore the size | A' z (p)j is easy 10 define, 
and here is taken to be half the diameter of K z (p). Usually K z (p) is an interval, and |A' z (p)| 
is half its length. Many subsets A" 2 (p)c A will make (10.2) correct with failure ratep, and one 
goal of CSI is to find a A z (p) for which | A z (p)| is small enough to make the estimate of the 
prediction z useful. 

Section 5 shows how to construct A z (p) in the simplest special case of (10.1), in which p E 
is known, dim X <dim Y , and F :X —>Y is injective. In this special case, (lO.le) is not needed, 
and the continuity of F and g is an automatic consequence of the fact they are linear and that 
dim X <«. The construction of K z (p) leans heavily on the fact that p E provides a natural dot 
product on T, the only dot product under which the variance tensor of p E is the identity tensor on 
Y . When S y and S z are balls or parallelipipeds centered on the origins in Y and Z, K z (p) is an 
interval whose center, z (0) , is g (x (0) ) where x (0) is the model in X which best fits the observed 
data vector y (0) in the sense of least squares, weighted by the inverse of the variance matrix 
EKSyiXSyj)]. 

When dim Af =«>, (lO.le) definitely is needed, and the dot product x 1 dx 2 =(2(xi,x 2 ) for 
xj,x 2 e X plays an essential role in the argument. Continuity of F and g is not automatic, and 
any Q which makes F or g discontinuous is too weak to resolve the nonuniqueness inherent in 
trying to make infinite-dimensional inferences from finitely many data. When F is discontinu- 
ous, the observer has no option but to seek a stronger Q . When g is discontinuous, a second 
option is available: to replace g by a continuous g A which resembles g closely enough that the 
new prediction, z A , is an acceptable substitute for z . 

When dim A’’ =°° and p E is known, CSI proceeds in two steps. The first (section 6) pro- 
duces a finite-dimensional approximation to (10.1), and the second (section 7) produces an injec- 
tive approximation to the first, thus subsuming the approximate problem under section 5. In the 
first step, the observer chooses any finite N and any N -dimensional subspace X h > of X. Then 
(lO.le) provides an A 7 -dimensional approximation to the original infinite-dimensional problem. 
The new model space is Af (0> . ) =Aj V , the new data function is F (G y^=F | A r (0<W) , and the new 
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prediction functional is g^)-g I X^y The approximation process adds a "truncation error" 
to each of the systematic errors 77 andf in (10.1), and so changes S Y and S z to larger sets S ^ 
and Syv but 0 the wise leaves (10.1) unchanged. One possible choice of X N is N/, the orthog- 
onal complement of the null space of F . This choice has two advantages: its data function F (o r \’) 
is automatically injective, so section 7 can be bypassed; and it may or may not add a truncation 
error to £ in (10. lh), but never adds a truncation error to 17 in (10. lg). Thus it assures 
Sy/ =S r . The disadvantages of the choice X F =N F are that it requires manipulation of D xD 
full matrices (D = 26,500 for many MAGS AT studies) and that the | AT Z (p ) | it produces may be 
unnecessarily large by several orders of magnitude. If the data are relevant to the predictions, 
usually there will be an X N with N «D which produces much tighter error bounds on 2 than 
does N/. 

Section 7 describes the second step in approximating (10.1) by a finite-dimensional prob- 
lem. This step depends on computing the eigenstructure (singular value decomposition) of 
so dot products on both and Y are essential. If0j£d 2 ^ • • • >0 

are the nonzero eigenfactors (singular values) of F (0>N ), and if {£ ]t ...,£ w } is a corresponding 
onhonormal eigenbasis for Y oF^y with cobasis {> 1 , .... Sm }. then for i =1, ...,M , 

. O0.3a) 

while if xe Pi { x i,— then F(x)=0. The tensor F ( 0 A) e Y ®^( 0 A) which 
corresponds to F^y.X (oj\’)-> Y is 

F (0 d°- 3b ) 

*« 1 

Let X (0i „) be the subspace of Y oF ( 0iN ) spanned by {X],...,^ n }, and let F (0(n) :=F | X (0>n ). If 
1 <n <M, then F (0 ^y.X ( 0 n ) -> F (X (0 _„ } ) is an injection, so section 5 produces a confidence set 
K% A) {p). For each n , 

z € Kf 0>n) (fi) (10.3c) 

with failure rate p . Now the observer can choose the n in 1 < n < M which minimizes 
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IKfa,)(p)l. If the data are relevant to the predictions and X N is large enough to give a good 
approximation to the original problem (10.1), then this optimal n will be neither 1 nor M, and 
must be found by numerical computation. In the approximate inverse problem whose data func- 
tion is F( 0tn ):X ( o t „)— the model components Hj ox with j <n are estimated from the data 
vector y (0) , and provide an estimate z^) for the prediction z. The components with j > n are 
not used in estimating z $ n > . All the model components with l<j <N contribute to 
| , the error estimate for z . The contributions of the components with j < n arise from 
the random error 5y and the systematic error rj in the data vector, while the contributions of the 
components with j >n come from the prior quadratic bound, (lO.le). If n is too small, useful 
data are being discarded, and if n is too large, some data are being used whose errors are so large 
that better error control in z is provided by the prior quadratic bound on x. 

If Pe is not known, it can be estimated from the data as long as m +n « D . This problem 
is not examined here for mtl, but the case m = 1 is studied in one common situation: the vari- 
ance matrix E[(8yi)(8yj)] is supposed to be an unknown multiple 0 2 Vj of a known D xD matrix 
Vj. The result, which seems likely to generalize to any m «D - n, is that ifp is not too small 
relative to (D -n )~' A in the sense of (5.8a), then a confidence set K e (p j) for 6 can be found such 
that pj «p and yet K e (p j) is so small that every 6 e AT e (p j) produces nearly the same p (0, •). 
Then for any 6 e Af e (p j), Pe can be assumed equal to p (0, •), and sections 5, 6, and 7 can be 
invoked for this known p E . 

The problem of geomagnetic modelling at the CMB illustrates CSI. The data are measure- 
ments of Cartesian components of the geomagnetic field B at points r, on or above the earth’s 
surface. The model space X is the set of all B outside the CMB which can be produced by elec- 
tric currents inside it. Both the energy bound (2.3) and the heat-flow bound (2.4) make the data 
function F :X ->Y continuous. If the prediction functional g :X ->R is defined by choosing a 
fixed r on the CMB and requiring g (B)=B r (r) for every B e X, then both the energy bound and 
the heat-flow bound make g discontinuous (unbounded). Therefore neither bound enables the 
observer to estimate B r (r) on the CMB from the data. If g (B) is the unweighted average of B r 
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over a disc on the CMB, the heat-flow bound makes g continuous and g(B) observable (i.e., 
estimatable from the surface data), while the energy bound makes g discontinuous and g (B) 
unobservable. If g (B) is the disc average weighted by a tent function, both prior bounds make g 
continuous and g (B) observable. 

Among the many choices for X N which have appeared in the literature of geomagnetic 
modelling, section 9 examines only the oldest, X N =X j^j, the space of models (2.1) whose Gauss 
coefficients /? /"(a) vanish for / >L. Then N =dimA’| L] = L(L+2) and F^j := F (0 ^ >=F| A'^j. 
To avoid numerical computation in finding the eigenstructure of F^j, some simplifying assump- 
tions are accepted. The data consist of all three Cartesian components of B measured by satellite 
at D/3 positions r ( - on a single spherical surface 5(c) of radius c =6771 km. The positions r, are 
nearly uniformly distributed on S (c ), and only truncation levels L are considered for which 
2L+1 « (D/3)' a , so the necessary sums over the r f can be approximated by surface integrals over 
S (c ). The random errors in the D individual component measurements of B are assumed to be 
independently and identically distributed as Gaussians with mean 0 and variance 8 2 . Then an 
orthonoimal eigenbasis for F^j in consists of the magnetic fields with a single non- 
vanishing Gauss coefficient, and the data vectors generated by these fields are the cobasis. The 
eigenfactors for F^j are calculated analytically to yield explicit formulas for \K z (p)\. Table 2 
shows the results when the satellite data are used to predict the Gauss coefficients /3/"(o) at the 
CMB. In that table, MAGSAT parameters are used: D = 26,500; 6 = 6 nT; and a systematic rms 
error u = 12 nT is produced in the data by the crustal field. The possibility is also considered that 
the crust might be amenable to treatment as a two-dimensional random process, so that u 
becomes a standard deviation rather than an error bound. The truncation level L =27 is chosen to 
make the truncation error no more than 1.2 nT. The actual value of q in (2.4) enters only through 
this truncation level. If q were 14 times its value in (2.4), the truncation level would have to be 
L =29. Thus the conclusions are quite insensitive to q. Table 2 is calculated for a failure rate 
p = 10 -4 , but relaxing this to p = lO -2 would not affect the table entries for the systematic crust 
and would decrease those for the random crust by ten percent. An improvement of accuracy by a 
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factor of about 8 is produced if the crustal error can be treated as random rather than systematic. 
The smallest circle of conlusion in observing B r on the CMB has diameter about 26° for a sys- 
tematic crustal error, and about 19° for a random crustal error, corresponding to highest resolv- 
able degrees of / = 1 1 and / = 8 respectively. 
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APPENDIX A. NOTATION FOR SET THEORY 

The notation for set theory will be that advocated by MacLane & Birkhoff (1967) and Birkhoff & 
Bartee (1970). For any objects, «,v ,w, {u.v.w} denotes the set whose members are u,v and 
w, while (u,v,w) is the ordered triple. Thus {u.v.w} = (v,w,«) but («,v,w) *(v,w,m). 
The null set, the set without members, is written 0. If X is a set, U cX means that U is a subset 
of X , i.e., every member of U is a member of X ; and x g X means that * is a member of X . Thus 
u g (w.v.w) and [w,u } £ {u , v,w }, but (m, v) g {u,v,w} and (m,v)£ { u,v,w }. The sym- 
bol "x e X" also abbreviates the noun clause "x which is a member of X." If P [x] is a statement 
about x, such as "x is an odd integer," then {x :P [x]} is the set of all x for which P [x] is true. 
The symbol means "is defined as." If X and Y are sets, then X nY := {u : u e X and u eY }, 
X uf := [u :u g X or u e Y or both), X\ Y := {u :u g X and u e T), and 

X xY := (u :u =(x,y) and x e X and y e Y). This last definition can be abbreviated 
X xY := {(x,y) :x € X and y e Y). The sets X nY, Xuf, X \ T and X xY are called the 
intersection, union, difference and Cartesian product of X and Y. 

Suppose X and T are sets and F is a function which assigns to each x e X a value 
F (x) e T. Then one writes "F :X — >y," usually read as "F maps X into T." The symbol 
"F :X Y " also stands for the noun clause "the function F which maps X into Y." The sets X 
and Y are the domain and codomain of F . Two functions F and G are equal iff (if and only if) 
they have the same domain A' , the same codomain Y , and F (x) = G (x) for every x e X . When 
F can be given by an explicit formula, another notation is available. For example, if R is the set 
of real numbers and F :R —>R is defined by requiring F(u) = u 2 +3u for every u e R, then 
"F :X — can be replaced by "F :m|-»« 2 +3u," or simply by "u|-»« 2 +3m." If P[x] is a 
statement about x , and F :X ->Y, then the set {y : y = F (x ) for at least one x g X which makes 
F[x] true) is abbreviated as {F(x):x g X and P [x]), or simply as (F(x):P[x]}. 

If F :X ->Y and U cX, then F(U) stands for the set (F(«) : u e U}. The setF(X) is the 
"image” of F. If U cX, then F\U denotes the function F : U -*F(U) such that F(u)=F(u) 
for every u g U. The function F| U is called the "restriction of F to U Its image coincides 
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with its codomain. 

If F:X->Y and G :Y ->Z, then the "composite" of G with F is the function 
(GoF):X -4 Z defined by requiring that for each x e X 

(i GoF)(x):=G(F(x )). (A.la) 

If also H : Z ->W, then (A. 1 a) implies 


Ho(GoF) = (HoG)oF . (A.lb) 

The identity mapping on X is the function I x -X such that I x {x)=x for every x e X. 
If F :X — >Y, then clearly IyoF = F = FoI x . 

If F :X Y and F(X) = Y, then F is a "suijection.” If F(x) = F(x) always implies x =X, 
then F is an "injection." If F has both properties, it is a "bijection ofX to Y If F :X Y is a 
bijection, then it has an inverse function, F _1 : Y —*X, defined by requiring that for each y e Y , 
F~ l (y ) is the unique x eX such that F (x) = y . 

A function G :Y ->X is a "left inverse" of F :X —*Y if GoF =I x , and a "right inverse" of 
F if FoG =/y. If F :X -4Z has a left inverse, then F is an injection; and if F has a right 
inverse, then F is a suijection. If F : X -4 Y has both a left inverse G : Y -4 X and a right inverse 
H :Y ->X, then F is a bijection and G =H =F~ l . For example, G =GoIy =Go(FoF -1 ) = 
(GoFjoF -1 = / x oF -1 =F~ l . 


R will always denote the set of real numbers. If a and b e R then a a& is the smaller of a 
and b, while a vb is the larger. The four kinds of intervals are 

[a ,b] := [u : u e R and a £u £b }, fft,b] := {u : a <u <b }, [a,b)‘-= {u :a £u <b ), and 

(a ,£>):= {u :a <u <b}. Suppose Y is a real vector space, AMs a positive integer, and for each 

a t g R and F;:X->Y. Then the linear combination, the function 

(Lit i fl; Fj ) : X Y , is defined by requiring that for each x eX 

r 


N 


N 


£ fl/F,- (x):= £ a i F i( x )- 

l J i=i 


(A.2a) 


If both X and Y are vector spaces, a function F :X — > T is called "linear" if for every positive 
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integer N and any a lf 

r 


N 


N 


a N e R and any x h 


Z °i x i 


1=1 


= Z a i F( x i)- 

i'=i 


Xyv e X 


(A.2b) 


Any linear combination of linear functions F; :X —>Y is itself linear. Since R is a one- 
dimensional real vector space, the definitions and remarks of the foregoing paragraph apply to 
functionals (functions with codomain R ). 

If X, Y and Z are real vector spaces and T :X xY —>Z, then T is "bilinear" if T(x,y) 
depends linearly on each of x and y when the other is fixed. Linear combinations of bilinear 
functions are defined by (A.2a) and are themselves bilinear. 

Suppose U and V are subsets of real vector space X , and A and B are subsets of R . Then 


AU + B V := {au+i>v:a e A, b e B , u e U and v e V } . 


(A3a) 


If c e R and x e U then 


cU:={c)U , (A3b) 

- U := (-1 )U , (A3c) 

x + £7 := {x} + £/ , (A3d) 

Ax:=A{x]. (A3e) 


In particular, o U - {0} and \U - U . Notice the distinction between U - V and U \ V . 

Acr-ring (Halmos, 1950) is a non-empty set Z whose members are subsets of another set X. 
To qualify as a cr-ring, I must have two properties: (1) P and Q e Z implies P \ Q e Z; (2) if all 
sets of the countably infinite sequence Q Q 2 , Q$, • • • are members of Z, then so is their union, 

•" (abbreviated as u (= j Q ( - ). 

A measure fi on a set X is a function whose domain is a cr-ring Z of subsets of X , whose 
codomain is R + , the set of non-negative real numbers together with +®°, and which satisfies 


^(^,“iC,) = Z^(2i) 

i=i 


(A.4a) 


whenever Q t e Z and Q , nQj = 0 for all i and all j*i. Property (A.4a) is called "countable 
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additivity,” and the members of I are then called the /t -measurable subsets of X (Halmos, 1950). 
If the whole setX is /x -measurable and if 

H(X)= 1 (A.4b) 

then/x is called a "probability measure" or a "probability distribution" on X . The /x -measurable 
subsets P and Q of X are then called "events." P is called "the event P ," "the event that x eP ," 
or "the event that P happens." Thus/Xx(? uQ)is "the probability that at least one ofP and Q 
happens," while/x* (P nQ) is "the probability that both? and Q happen," and fL x (P \j2) is "the 
probability that P happens and Q fails." Clearly 

Id x (P u Q ) ^Hx (P ) (2 ) (A.4c) 

and 

Hx (P n Q ) ^Px ( p ) x (2 ) • (A.4d) 

If fix an d/Xy are probability measures on sets X and Y, their a-rings being X, x and Zy , then 
the product measure on X xY is defined as follows. Itsa-ring is the intersection of all the 
a-rings of subsets of X x Y which include as members all the sets P x Q with P e X. x and 
Q e Zy . For such sets, 

lix*y(P *Q) : =ldx( p )l l Y(.Q) (A.5) 

and (A.4a) permits the calculation of/z*xy(W) for any other set W e (Halmos, 1950). Set- 
ting P =X and Q -Y in (A.5) shows that ii Xx y is a probability measure onXxf. In this con- 
text, if P e X-x and Q e Zy , then the event P x Y is called the event P , and the event X x Q is 
called the event Q , so (A.4c,d) remain true. The probability that at least one of events P and Q 
happens is no more than the sum of their probabilities, and the probability that both P and Q 
happen is no more than the smaller of their probabilities. 
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APPENDIX B. REAL HILBERT SPACES FOR MODELS AND DATA 

In linear inversion, the model space X is a real linear space (vector space; Halmos, 1958), and its 
dimension, dim X , is infinite. A quadratic inequality like (2.3) or (2.4) induces on X a natural 
dot product, which makes X a pre-Hilbert space. Then (Halmos, 1951) X can be completed to a 
Hilbert space. On the data space T, the probability distribution for the random error vector 8y 
induces a dot product. Since D =dim y <«>, Y is automatically complete, and so a finite- 
dimensional Hilbert space. These natural dot products on X and Y will be constructed in this 
appendix. 

On X , a quadratic inequality like (2.3) or (2.4) produces a symmetric, positive-definite bil- 
inear functional Q . This Q assigns to any pair (x,x) of models in X a real number Q (x, X) such 
that 


<2(x,X) = 2(S.x); CB.la) 

(2(x,x)>0 (B.lb) 


if x*0; and for any positive integer#, real numbers a x , ...,%, and models x 0 ,xj, ....x^, 

N N 

Q ( x o> x <3/ X,)= X a i Q ( x o> X; ) • ( B - 1 C) 

i=t i=i 

Q is "bilinear" because (B.la) and (B.lc) imply that Q (x, x) is also linear in x when x is fixed. 
The observer’s prior information is the knowledge of a real number q such that the correct earth 
model x must satisfy Q (x, x)<q, or 

<7 -1 !2(x,x) < 1 . (B.2) 


For example, in (2.3) if/? /"(a) and /3/"(a) are the Gauss coefficients of the magnetic fields B and 
B at the CMB, then 


<7 -1 f2 (B, B) = (2 xlO 33 ) -1 


/ 


L E 

/ =1 m =-*/ 


2/+1 

/+1 


/?/"(a)& m (a). 


On X , define the inner product or dot product x px by 
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xqx := <7 _1 j2(x,x) (B.3a) 

and the length ||x|| by 

tlx || := (xdx) 1/4 (B.3b) 

The notation v v will be reserved for use with the ordinary real three-dimensional vectors. On 
being completed in the norm ||x ||, X becomes a Hilbert space. The investigator’s prior quadratic 
information (B.2) can now be written 

11*11*1. (B.4) 

In geomagnetism, the simplicity of the recursion relations for complex spherical harmonics 
sometimes makes it useful to consider a complex model space X *, one whose scalars are com- 
plex. A real Hilbert space X can always be extended to a complex Hilbert space X*. The 
members ofX* are the formal symbols x + iX where x and X e X and i =(-l) 1/ \ Linear combina- 
tions with complex scalars are defined in the obvious way using t 2 =-l, and the complex conju- 
gate of x+i'X is (x + /X) c = x-i'X. If \’i, x 2 , Xj, XjsX, the dot product of wjsxj+iXj and 
w 2 = x 2 + i X 2 is defined as wj d w 2 := Xj □ x 2 - Xj □ X 2 + i [Xj o x 2 + x 3 □ X 2 ] and the inner product 
of Wj and w 2 is (wj | wf) := w 1 c dw 2 . Complexifying X in this way introduces a distinction 
between inner and dot products which engenders complications in the notation, so only vector 
spaces with real scalars will be admitted here as model or data spaces. 

On the data space Y , the inner product comes via integration from p E , the probability distri- 
bution of the random error vector 5y. As usual, integrals of real-valued functions of Sy with 
respect to p E will be written as expected values. Thus 

EtfyA^ldpzWy, (B.5a) 

Y 

E [ := J dp E (y)y- t yj . (B.5b) 

r 

If p E is known and £[<5y, ]^0, then its known value can be subtracted from both sides of (1.1). 
This redefines v (0) and Sy in such a way that 

£[5y,] = 0. (B.6a) 
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If p E is being estimated from the data, £[<5)’, ] will be estimated. Alternatively, p E can be 
parametrized so that E [<5y, ] = 0, and the necessary correction can be added to 77 ,- or £,(x) and 
subtracted from 8y\ in (1.1a). This slightly increases the bound on rj or the amount of informa- 
tion carried by the models in X, and again achieves (B.6a). Henceforth, (B.6a) will be accepted. 
Then the D xD variance matrix V of 8y has entries 

Vij =£[(5y,)(5y;)] . (B.6b) 

This variance matrix is positive semi-definite. If it is not positive-definite, then, with probability 
1, 8y lies in a subspace U of Y (Cramer, 1946). Then there are only dim U linearly independent 

linear combinations of 8y i ,...,8y D . The corresponding linear combinations of y ] y D can be 

taken as the new data, and U is the new data space. The remaining D -dim U linear combina- 
tions of y }p have no random error. In the real world, this must mean that their values are 

obtained from theory rather than measurement. They are not really data, and can be discarded. 
(Usually, they will all vanish.) Henceforth, the matrix V of (B.6b) will be supposed positive- 
definite. 

Now let TV := V -1 . The D xD matrix IV is the "weight matrix" generated by the random 
errors. It is symmetric and positive-definite. On the data space Y define the dot product of 
y = 0’i Yd) and y = (y i,...,y D ) as 

D 

yoy:= £ yiWijfj. (B.7) 

ij = 1 

This dot product measures the data in units of their random errors, so it is a dimensionless pure 
number. It will now be shown to have the property that for any fixed y and y in Y 

£[(yD<5y)((5yciy)] = yoy- (B.8) 

Conversely, no other dot product on Y has the property (B.8). That (B.7) does imply (B.8) fol- 
lows immediately from (B.6b) and the definition of W. For the converse, suppose that yo y is 
any dot product on Y. There will be a symmetric, positive-definite D xD matrix U such that 

yoy= X Yi u ijSj ( B -9) 

‘.7-1 
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(Halmos, 1958). If the dot product (B.9) has property (B.8), then 

D D 

X yi Uij Vj k Uuy,= £ >>,- U a Si ■ 

i, >,*,/ = ! i,/= 1 


Since y and y are arbitrary, it follows that UVU = U . Since U is positive-definite, U~ l exists, so 
UV =/ , where / is the D xD identity matrix. Hence U = V -1 = W, so (B.9) and (B.7) agree, 
and yoy = yoy. 
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APPENDIX C. TENSORS AND LINEAR MAPPINGS ON HILBERT SPACE 

This appendix is a self-contained list of the facts and sometimes slightly unconventional Hilbert 
space notation used in the present paper. Proofs of all assertions can be found in Halmos (1951), 
Dunford and Schwartz (1958), or Lorch (1962). Some of the shorter proofs are given here to aid 
the exposition. 

A linear mapping F :X — * Y from Hilbert space X to Hilbert space Y is "bounded" if there 
is a real number K such that for every xeX, 

||F(x)|| <AT ||x|| . (C.la) 

Halmos (1951) shows that a linear mapping of one Hilbert space into another is bounded iff (if 
and only if) it is norm-continuous (i.e., lim ||x„ -x||=0 implies lim ||F(x B )-.F(x)|] =0). The 

n — n — »«> 

smallest K which works in (C.la) for all x e X is called the norm of F, written ||F||. Then the 
best possible version of (C.la) is 

||F(x)||<||F||||x||. (C.lb) 

A bilinear functional T :X xY -*R is "bounded" if there is a real number K such that for 
each (x, y)eXxf 

|T(x,y)| <K ||x|| ||y|| . (C.lc) 

A bilinear T is norm-continuous iff it is bounded. The smallest K which works in (C.lc) for all 
(x,y)e X xY is called the norm of T and written ||T||. IfX and Y are Hilbert spaces whose dot 
products are written with and if T:X xP is a bounded bilinear functional, T(x,y) will 
often be written as x □Toy. Thus 

xoToy := T(x,y) , (C.ld) 

and the best possible version of (C.lc) is 

| x □ T o y | <11*11 ||T||||y||. (C.le) 

The real number xoTny depends linearly on each of x, T and y if the other two are fixed. 
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If X and Y are Hilbert spaces, the set of all bounded linear mappings F :X ->Y will be 
denoted by BL(A" —> Y), and the set of all bounded bilinear functionals T :X x F — >R will be 
denoted by X ® Y . The members of A' ® Y are "tensors over X x Y If linear combinations are 
defined by (A.2a), both BL(A' — » Y ) are X ® Y are real vector spaces. In fact, if N is any positive 
integer, and F, e BL(A' — » Y), T ,• e X ® F , and a,- e R for each i e { 1, ...,N } , then 


||£ fl( -F ( ||<£ \a : \ ||F,|| 
1=1 1=1 

II 2 a i Till - 2 Kl IIT.-II . 
1=1 1=1 


Moreover, if Z is also a real Hilbert space, and G e BUY — >Z), then 


(C.2a) 

(C.2b) 


||GoF||<||G|| ||F|| 


(C.2c) 


Inequalities (C.2b,c) follow immediately from the definitions of the norms, while (C.2a) is a 
consequence of two facts about any Hilbert space Y : if a e R and y, y eY then 


Ml = \ a\ ||y|| 


and 


lly+yll ^llyll + llyll • 


(C.3a) 


(C.3b) 


The triangle inequality, (C.3b), is itself an immediate consequence of the Schwarz inequality, 
|yay | < |}y|| ||y|| , (C.3c) 

which in turn follows from the fact that y' Dy' £ 0 for all y' e Y (Halmos, 1951). 

When X and Y are Hilbert spaces, every function in BL(A' -+Y) can be thought of as a ten- 
sor in Y ® X and vice-versa. This correspondence is useful both to simplify notation and because 
it permits any operation applicable to either a tensor or a linear mapping to be transferred 
immediately to the other. To establish the correspondence, let F e BL(Af — »T), and define 
T/r : Y xX — » R by requiring that for each (y, x) e Y xX 


Tf (y,x) = y oF(x) 


(C.4a) 
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Clearly T f is a bilinear functional on YxX. Moreover, by (C.lb) and (C.3 c), 
I Tp (y, X) | < llyll ||f || II x|| , so TV e Y ® X and || T F || < |JF || . In fact, 

I|T f || = ||F||. (C.4b) 

To see this, let Xi,x 2 , ••• be an infinite sequence of nonzero vectors in X such that 
lira ||F(x n )||/||xJ=||F||. Let y„ :=F(x n ). Then T F (y„,x„) = ||yJ l|F(xJ|| so 

lim |T f (y n ,xJ|/|ixJ||yJ=||F||. 

n —+**> 

Therefore ||T f || >||F||. Since also ||T/r|| S||F||, (C.4b) is established. 

Equation (C.4a) is a rule which assigns to each bounded linear function F e BL(X — » Y) a 
tensor T F e Y ®X. That is, (C.4a) defines a function BL(X — > T) -» Y ®X such that for each 

F e BL(X ->Y), 

1(F) = T f . (C.4c) 

It follows immediately from (A.2a) that 5 is linear, and (C.4b) shows that If preserves norms. 

To show that if is a bijection from BL(X xT) to Y ®X, an inverse for it will be con- 
structed. First, suppose that / e BL(F ->F). Then (Halmos, 1951) there is a unique vector 
y f e Y such that for every ye Y 

f (y) = y □ >’/ • (C.5a) 

Now suppose Te Y ®X. For any fixed xeX, consider the functional f x :Y-^R defined by 
requiring that for each ye 7 

/ x(y) = T(y,x) . (C.5b) 

Denote this functional f x by T(-,x). Clearly f x e BL(F ->R), and in fact ||/ x || <||T|| ||x||. 
Therefore (C.5a) applies to/ =f x . In Y there is a unique vector y ft , or y T( . iX) , such that for every 
ye Y, / x (y)=yoy T (-,x). That is 

T(y, x) = y Dy x (-,x) . 
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Now define F T : X — > Y by requiring that for each xe X 

Fy(x) = y T (‘. *) • (C.5c) 

Then F T is uniquely determined by the fact that for each xeX and ye f, 

T(y,x) = yD F-rCx). (C.5d) 

Finally, define f : Y ®X — > BL(A' — > Y) by requiring that for each Te Y ®X , 
f(J ) = F r . (C.5e) 

Comparing (C.4a) with (C.5d) shows that F -F r iff (if and only if) T = T f . That is, 
$:Y ®X — >BL(X — >T) is both a left and a right inverse for J:BL(X ->Y)—*Y ®X. There- 
fore.^and ,7are both bijections, and each is the inverse of the other. Since.7 is linear, its inverse 
must be linear. The linearity of $ can also be inferred directly from the uniqueness of the F r 
defined by (C.5d). Of course, (C.4b) now implies 

II^tIMITII. 

Henceforth it will be convenient to confuse the bounded linear mapping F : X — » Y with the 
tensor T F e Y ®X, and to write T F as F. Then, in the notation of (C.ld), 

ynFnx = vnF(x) (C.6a) 

for every xe X and y e Y. If either F or F is given (C.6a) uniquely determines the other. Furth- 
ermore, (C.4b) can be written 

l|F|| = ||F||. (C.6b) 

Equation (C.6a) suggests writing the vector F (x) as F dx, so 

F dx := F(x) . (C.6c) 

Then (C.6a) becomes 

yoFDX = yD(FDx). (C.6d) 

The vector F dx depends linearly on each of F and x when the other is fixed, and moreover (C.lb) 
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takes the Schwarz-like form 

||Fox||<||F||||x||. (C.6e) 

A special case of (C.6) is that in which Y =R. Then F is a linear functional / : X — » R , F 
is the corresponding vector xy e X , and (C.6a) reduces to (C.5a). 

A first application of the correspondence (C.6a) between tensors and bounded linear map- 
pings will be to define the transpose of each. For Fe Y ®X, the definition of the transpose F r is 


simple: forever)' (j’.xJsfxX, 

F T (x, y) = F(y, x) . (C.7a) 

Evidently F r e X ®T, 

l|F|| = ||F r || (C.7b) 

and 

(F 7 ) 7 =F. (C.7c) 

Moreover, if F, G e Y ® X and a , b e R , then clearly 

(aF + bGf =aF T +bG T . (C.7d) 


Now' suppose F e BL(X — » Y). Then F T , the transpose of F , can be defined as the mapping in 
BL(T — >X) corresponding to the tensor F r e X 0 Y . Thus F T :Y ->X is the unique function 
with domain Y and codomain X such that for every (x, y)e X xY, 

yDF(x) = xnF T (y). (C.7e) 

Without a discussion equivalent to the introduction of tensors, it is not clear that a function 
F t :Y with property (C.7e) exists at all. The tensor argument shows that a unique F T 
exists, is linear and bounded, and satisfies (C.7b,c,d). 

It will be useful to write F r (y) as ynF. Thus, if y e Y and F e BL(X — > }'), then invoking 
(3.9c) gives 

y D F:=F T (y) = F T Dy (C.8a) 
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The "dot product" y qF is bilinear in y and F, and (C.6e) and (C.7b) imply 

lly dF|| < ||y|| ||F|| . (C.8b) 

Equation (C.7e) can now be rewritten as yn(F Qx) = XD(ynF), which is (vciF)qx. Thus, (C.6d) 
and (C.7e) now yield 

y n(F dx) = (yoF) ax = y pF □x = XDF r ciy. 

The correspondence (C.6a) also provides a simple way to define symmetric and positive 
definite linear mappings. A mapping F e BL(X -»X) is symmetric if F T =F, and positive 
definite if xdFdx>0 whenever x e X and x*0. 

Composition is an operation more easily defined for functions and then transferred to ten- 
sors. Suppose that X, Y, and Z are Hilbert spaces and Ge Z ®T, and Fe Y ®X, ThenGoFis 
defined as the member of Z ®X such that for each xeX 

(GdF) dx := G p(Fdx) . (C.9a) 

If W is another Hilbert space and H e W ® Z, then (A. lb) implies 

Hp(GdF) = (HoG)dF. (C.9b) 

The identity 

(GdF) 7 '=F t qG 7 ‘ (C.9c) 

and the corresponding result for G and F come immediately from the following tedious but sim- 
ple application of the definitions and the various associative laws: for every xeX and z g Z , 
x □ (G □ F)^ o z = zo(GoF)ox = zd[(GdF)dx] = zd[Go(Fdx)] = (zdG)d(Fdx) = 
(Fdx)d(zdG) = (xoF r )D(G 7 az) = xo[F r d(G 7 dz)] = x □ [(F 7 oG r )nz] = 
x □ (F 7 pG 7 ) □ z. Then (C.9c) and (C.8a) imply that for every z e Z 

(zdG)dF = zd(GoF). (C.9d) 

The identity tensor on X is defined as the tensor I* e X ®X which corresponds via (C.6a) 
to the identity mapping I x e BL(X — >X). To any (x,x)e X xX, I x assigns the value 
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I* (x, x) = x □ I* □ x = x □ x . (C. 1 Oa) 

Then evidently 

i$=i x . (C.iob) 

If F e BL(X — » T) has an inverse, F~ x :Y -*X, then so does F r e BL(Y — >X r ), and 
(F T )~ l =(F~*) T . (C.lOc) 

The proof is this. Since X and Y are complete, the Banach inverse theorem (Luenberger, 1969, p. 
149) says that F _1 g BL(T -»X). The tensor in X ®F corresponding to F~ l is written, of 
course, as F" 1 . Now (F~ l ) T g BL(X -> K) does exist, and (F~ l ) T oF T =(F of' 1 ) 7 = /f = I Y , 
while F t o (F~ ] ) t = (F~'oF) T = = / x . Thus (F -1 ) r is a right and a left inverse for F r . 

Hence ( F T )~ l exists, and satisfies (C.lOc) 

Another application of the correspondence (C.6a) is in the construction of dyads, which are 
very easy to define as tensors. Let X and Y be real Hilbert spaces and let xg X and y e Y. 
Define the "dyad" xy g X ® Y by requiring that for every x g X and ye Y 

XD(xy)oy = (xox)(yciy) . (C.lla) 

The dyad xy is often written x ® y. Evidently 

(xy) 7 =yx. (C.llb) 

For any a g R and x g X , ax will also be written xa . With this convention, the mapping in 
BL(T —>X) which corresponds to the tensor xy is determined by the fact that for each ye 7 

(xy) □ y = x(v ay),. (C. 1 1 c) 

and its transpose is determined by the fact that for each xeX 

x □ (xy) = (x □ x)y . (C. 1 1 d) 

There is still another pair of associative laws for dyads. Suppose X , Y , Z are real Hilbert spaces, 
and Fg Z <8>X, and Gel ®Z, and xg X, and y g Y. Then for any yc Y, [Fo(xy)]oy = 
Fo[(xy)oy] = Fci[x(yoy)] = (F Dx)(yDy) = [(F Di)y]ny. Therefore 
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F n(xy) = (FDx)y . (C. lie) 

Similarly 

(xy)oG = X(yoG). (C.llf) 

It will be necessary to define the integral of a function whose codomain is a set of tensors, 
because variance tensors of probability distributions are such objects. Suppose that Cl is an arbi- 
trary set and p is a probability distribution (measure) on Cl (see appendix A and Halmos, 1950). 
For a large class of functionals / :C1->R , called the p -integrable functionals, the integral 

J dp (co)f (co) is defined, and is usually written as an expected value, 
a 

E p \f (a>)] := J dp (co)f (co) . (C.12a) 

n 

Now suppose that X and Y are Hilbert spaces and T : Cl X ® Y . For each fixed (x, y) e X ® Y , 
<01->xDT(tu)ny defines a functional on Cl. Suppose that for each (x,y) e X xY this functional 
is p -integrable. Then it is possible to define the integral of T(<u) with respect to p , written either 

J dp (co)T(co) or E p [T(g»] . 
a 

By definition, E p [T(co)]:X xf ->/!, and this functional is defined by requiring that for each 
(x,y)e XxY 

E p [T (co)](x, y) := E p [x □ T(co)ay] . (C.12b) 

Clearly E p [T(©)] is bilinear in x and y. If it is also bounded, then E p [T(co)] e X ® Y, and the 
function T.:C2-*X ® Y is called p -integrable. For example, if the functional co (-» || T(ty)ll is p - 
integrable, so is T. If T is p -integrable, then the definition (C.12b) of its integral can be rewritten 

x d E p [T(£o)J d y := E p [x □T(&))ay] . (C. 1 3a) 

One consequence of (C.13a) is that if W and Z are also Hilbert spaces, and Pe W ®X and 
Q € Y ® Z, then P dT and T oQ are integrable, and 

Po£,[T(ft))]=£,[PnT( ffl )] (C.13b) 
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and 

E p [T(©)] □ Q = E p [T(<y) □ Q] . (C. 1 3c) 

In the special case of (C.12) which arises in CSI, O. is the data space Y , p is the probability 
distribution p E of the random error vector 8y, and X is either R or Y. Following the convention 
introduced in (B.5), £ P£ [T(y)] will be written £[T(5y)]. Then £[<5y) is defined as a member of 
R ® Y, i.e., of T, and clearly (B.6a) implies 

£[£y] = 0. (C.14a) 

The variance tensor of 8y is defined as 

V:=£[(<5y)(<5y)), (C.14b) 

because of (C.14a). From (C.13a), if y and ye Y then 
y □ Vci5' = £[(yD$y)(5yoy)] . 

Therefore, from (B.8), 

yoVDj^yoy- 

Then, from (C.lOa), 

V = Iy. (C.14c) 

Thus the dot product (B.7) is the only dot product on Y which makes V, the variance tensor of 8y, 
equal to the identity tensor on Y. If p E is Gaussian, then adopting on Y the dot product (B.7) 
gives p E the density function 

dp E (y) = ( 2 n)~ D/2 txp(-V 2 ||y|| 2 )dP y (C.15a) 

where do y is the volume element in Y defined by the dot product (B.7). If some other dot pro- 
duct, yo 5’, is adopted for Y, then V is defined by (C.14b) and V * Iy . A Gaussian p E will have a 
density' function given by 

dp E (y) = (2^) _£>/2 (det V) -Vi exp(- J /iy o V -1 o y) do y (C.15b) 

where dPy is the volume element in Y defined by the dot product o. 
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APPENDIX D. THE EIGENSTRUCTURE OF A BOUNDED LINEAR FUNCTION WITH 
FINITE DIMENSIONAL CODOMAIN 

When the model space X and the data space Y are Hilbert spaces, and the data function 
F :X — >Y is linear, section 7 describes the calculations needed to optimize resolution in 
confidence set inference. Those calculations depend on the eigenstructure of F (its singular value 
decomposition). That eigenstructure is determined by the facts that 

dim Y < «> (D.la) 

and 

||F||<oc. (D.lb) 

The algebraic discussion in Golub and Van Loan (1983, p. 16) suffices for computations, but the 
geometrical discussion given here may be a useful supplement for readers who, like the author, 
find it easier to think geometrically. 

The geometrical discussion is based on orthogonality. If X is a real Hilbert space and 
x,xeX, and if xqx=0, then one writes "x±x", read "x is orthogonal to x." A set 
{£j,$ 2 i • • • ) <^X is ''orthonorm al" if 

(D.2) 

(5ij = 1 i fi=j and <5,y =0 if / * j). Every orthonormal set is linearly independent. 

If xe X, U cX and V qX, then xlU means that xlu for every ue U , and U ±V 
means that u ± v for every u e U and v e V . Define 

x □ U := {xdu : ue U } 

and 

U dV := {udv:ue 17 and veV}. 

Then x±U iff xnU = (0), and U 1 V iff(/oV = {0}. 
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If x e X and also x±A' then xnx = 0, so x = 0. The most useful corollary of this observa- 
tion is that if Xj and x 2 e X and Xj ox^x^dx for all x e X , then (xj -x 2 ) -LA' so Xj =x 2 . 

If U cA' and V cX and U IV 7 , then U +V is written U © V 7 . Every vector in U © V can 
be written in exactly one way as u + v with u e U and v € V 7 . To see this, suppose u, u' e U and 
v, v' e V . Then (u - u') □ (v 7 - v) = 0. If also u + v = u' + v', then u - u' = v' - v = 0. 

If U cAf, then 

t/ x := {x:xe X and xlC/}. 

The set U 1 is the "orthogonal complement" of U . It is always a subspace of X , even if U is not, 
and it is closed in the sense that if X),x 2 , • • • e U 1 and lim ||x„ -x|| =0 then x€ U x (Halmos, 

n -♦« 

1951). Clearly C7 1 1/ 1 so 

u c (t/- 1 ) 1 . 

If U is itself a subspace of X , then (Halmos, 1951) 

U e = (L^) 1 (D.3a) 

where U e is the closure of U . If U is a closed subspace of A!' , then U c =U so 
U = (U 1 ) 1 ; (D.3b) 

in this case 

X = U © U 1 . (D.3 c) 

In particular, (D.3c) holds if dim £/<«>, because every finite-dimensional subspace of X is 
closed. Equation (D.3c) means that for each xeX there are unique vectors x v and Xy such that 
x v s U , xfie V 1 , and 

X = Xy +Xjj . (D.3d) 

Of course it is also true that 

x v o xjj = 0 . (D.3e) 
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If U is a closed subspace of X, (D.3d) permits the introduction of functions Py :X —>X 
and Qu :X — »X defined by requiring that for each x € X 

Py(x) = xy ( D - 4a ) 

Qu(*) = *u- < D - 4b ) 

The function Py is called the orthogonal projector of X onto U. It is a surjection onto U, and 
p u | u =j v . Qearly Qy is Py. From the uniqueness of x v in (D.3c) it follows that Py is 
linear. From (D.3d) it follows that 

l|x|| 2 = ||x l/ || 2 + ||x^|| 2 , (D.5) 

so Hxyll <||x||, and hence HFf/ll - 1- Evidently 

l|Ft/|| = 1 if f7*{0), (D.6a) 

and 

\\Qu\\ = 1 if U*X. (D.6b) 

Also, from the definitions, 

Py + Qu —fx (D6c) 

PyOPy=Py (D-6d) 

Qu ° Qu — Qu (D.6e) 

PyoQu=QuoPu=0. (D.6f) 

Moreover, if x and X€ X, then xoPy(x) = xoxy = x v dx^ = x v dx = Py(x)nx = xoPy(x), so 
Py = Py ■ (D-6g) 

Suppose U is a finite dimensional subspace of X. Let dim U =N <«>, and let {fy, } 
be any orthonoimal basis for U (many can always be found; Halmos, 1958). Then 

P t/ = £ *; x,- • (D-6h) 

i=l 

To prove this, let x e X . Then x v e U,so 

N 

X U = 'Z °i *< 

i=l 
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where 

Cl; = £; □ X[/ . 

But it; € U SO £, dXjjT = 0. Thus 
O; = D X . 

Therefore 

N N N N 

T> v ax = xu = £ = £ */(*,■ dx)= E K*« */)ox] = ( E *« *«) ° x • 

1=1 1=1 1=1 1=1 

This proves (D.5h). 

The foregoing properties of orthogonality can now be used to obtain the eigenstructure of 
F :X — » T when X and Y are Hilbert spaces, F is linear and bounded, and (D.l) holds. First con- 
sider the null space of F , defined as 

N F := (x:xeX and F (x) = 0} . (D.7a) 

Since F is linear, Np is a subspace of Since F is continuous, Np is closed. Therefore 


X = N f © N/ . (D.7b) 

In the spirit of the conventions (C.6c) and (C.8a), define 

FdI := (FDx:xeA) (D.8a) 

and 

Y dF:= {ycFtyeT} . (D.8b) 

Then, clearly, 

FdX =F(X)qY (D.8c) 

and 

Y oF = F t (Y)zX . (D.8d) 


Since Y oF is the image of the linear function F T : Y —>X whose domain is finite-dimensional, 
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dim (Y qF) < °° (Halmos, 1958). Define 

M := dim (Y dF) . (D.8e) 

The identity 

(Y oF) ox = Y □ (F nx) 

holds for every xe X. Therefore x ± (Y oF) iff Fnx = 0, i.e., iff xe N F . It follows that 
N f =(TdF) x , (D.8f) 

and. by (D.2a) 

N/ = (TdF) c . 

But, being finite-dimensional, Y nF is closed, so ( Y □F) c =Y qF, and hence 
F dF = N f . (D.8g) 

Then (D.7b) can be written 

X=N f ©(FdF). (D.8h) 

Equation (D.8h) resolves X into the two pieces N F and Y dF. The function F | N F is 
trivial, simply 0, so to study F it suffices to study the function 

F := F | Y □ F . (D.9a) 

The function F has two important properties. Firstly, 

F(TdF)=F(X). (D.9b) 

To see this, take xeX and, using (D.7h), write 

x = x YoF + x r oF • 

Since Xy oF e N F , therefore F(xy- oF ) = 0, so F(x) = F(x Y oF )=F(x F oF ). This proves (D.9b). The 
second important property of 

F:YoF->F(X) (D.9c) 
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is that it is an injection. For suppose x e Y □ F and F (x) = 0. Then F (x) = 0, so x € tip. Then by 
(D.8f), xe (Y dF) x as well as Y oF, so x = 0. Since F is linear, it is injective. But this fact and 
(D.9b) show that F is a bijection, and thus has an inverse, 

F~ l :F(X)->Y dF. (D.9d) 

Now consider the function F T o F : Y oF-» Y dF. This function is symmetric, for 
(F t oF) t =F t o (F t ) T =F t o F . It is also positive definite, for if xe Y oF and x*0 then 
F(x)* 0, so 0<||Fox|| 2 = (Fox)d(Fdx) = (xdF 7 )d(Fdx) = x a (F T □ F) □ x. The domain 
Y dF of the symmetric, positive definite function F T oF has finite dimension M. Therefore, 
counting repetitions of multiple eigenvalues, F T oF has M positive real eigenvalues (Halmos, 
1958). These can be ordered as 

■ • • >4>h >0. (D.lOa) 

Furthermore, FdF has an orthonormal basis {£, } consisting of eigenvectors of F r oF 

with the eigenvalues (D.lOa). That is 

F T oF(f ; )=f ? ii (D.lOb) 

for i = 1, ...,M. This last equation can also be written 
F r oF □ i,- =<p?Xi . 

Then, by (D.2), 

Zj □ F r d F □ x, =<t>?8 tJ . 

Therefore, by (C.8a), 

(Fofy)Q(fD*i)=*, ? $v . (D.lOc) 

Next, define 

5’,- :=<t>r ] F(£i) . (D.lla) 

By (D.lOc), (5’i» ) is an orthonormal subset of Y dF. Since dim Y oF=M, {$’i,...,5V) is 

an orthonormal basis for Y dF. Moreoever, from the definition (D.l la), 
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for i = 1 , A/ . An immediate consequence of (D.l lb) is that in Y ® (F dF) 

F =Z<t>i$i*i • 

1=1 

To see this, note that 

M 

ly dF = Z 
1=1 


(D.l lb) 


(D.l lc) 


(D.l Id) 


and that 

m m . 

F = FDly DF = £ (F □ *,-)*; = L Ftf,)*; . 

i=i «-l 

Now (D.l lc) follows from (D.l lb). A useful consequence of (D.l lc) is that 

i=l 

To prove (D.lle), denote its right-hand side by G, and observe from (D.l Id) and (D.llc) that 
G □ F = F oG=Iy oF . 


Finally, from (D.8h) and (D.l lc) it will be shown that 

F =£0, 

i=l 


(D.l 2a) 


where now the right side of (D.12a) is interpreted as a tensor in Y ®X. Equation (D.12a) exhi- 
bits the eigenstructure of F, its singular value decomposition. The positive real numbers 
<p j, are the eigenfactors or singular values of F , {xj, .... } is an eigenbasis for F , and 
(5’j, ..., } is the corresponding cobasis. 

To prove (D.12a), let G denote its right-hand side. It will suffice to show that F □x = Gnx 
for ever}' x e X . But if x e X then 


m m 

F D X = F (X) = F (Xy oF ) = F (Xy oK ) = F D Xy oF = ( £ $ / Si */ ) □ Xy oF = X/-1 0 < Si (*« ° *Y oF> 

i=1 
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Since £,■ g T dF, x,- dx F dF = 0, so £,■ dx f oF = ox. Therefore 
Fdx= Si (*« □*) = ( 2 <P, Si «;)dx . 

i=i <=i 

This completes the proof of (D.12a). 

If N f = {0), so y dF=X, then F =F, and (D.l le) will hold forF as well as F. In general, 
N f * (0), and then F _1 does not exist. When N F * {0}, F is said to have the eigenfactor or 
singular value 0 as well as the positive eigenfactors 0 j >0 2 - ' ' ' > 0. Even when N F * (0), 

F T :Y —>X exists, and it follows immediately from (D.12a), (C.7d) and (C.l lb) that 

r T = Z<Pi*iSi • (D.i2b) 

i=i 

Thus F and F T have the same eigenfactors, and the eigenbasis for either is the cobasis for the 
other. 

The foregoing discussion can be carried out verbatim when D =dim Y =<*= as long as 
F :X Y is compact (Kato, 1976). The number M of nonzero eigenfactors will always obey 

M Si (dim X) a (dim Y) . (D.l 3) 

This point is of only academic interest in practical problems, because they never supply infinitely 
many data. 

The eigenstructure theorem (D.12a) can be applied either to the whole data function 
F :X ->Y on the infinite-dimensional model space X or its restriction F (0 ^) F to any 

- N -dimensional subspace A' (0>a >) of AT. The calculation of resolution in section 7 will be formally 
the same in either case. However, if a subspace A (0 ^) can be found which permits tight bounds 
on the truncation errors (6.7) and also satisfies N «£> , then section 7 should be applied to F (0 ^) 
rather than F. Truncation will produce only a small loss of accuracy, and will permit the very 
large computational economy consequent on manipulating N xN rather than D xD matrices. 
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Table 1 


p 

v(p) 

10- 2 

2.58 

KT 3 

3.29 

KT 4 

3.89 

io- 5 

4.42 

KT 6 

4.89 

IO" 7 

5.33 


Half-length v(p ) of the symmetric confidence interval with failure rate p for a one- 
dimensional Gaussian with mean 0 and variance 1. 


Table 2 


/ 

</?/"(* ) 2 >m 

\Kfa(P)\ 

\K Z V] (P)\ 

1 

107.77 

.07 

.009 

2 

22.59 

.11 

.015 

3 

23.01 

.19 

.024 

4 

17.86 

.32 

.042 

5 

11.96 

.57 

.075 

6 

9.36 

1.03 

.135 

7 

7.93 

1.87 

.25 

8 

4.73 

3.42 

.45 

9 

6.78 

6.30 

.83 

10 

4.59 

11.67 

1.53 

11 

4.34 

21.69 

2.85 

12 

3.62 

40.5 

5.32 


Confidence sets for the Gauss coefficients when the quadratic bound is the heat flow bound. 
Column 2 gives the value in microTeslas of the rms of the Gauss coefficients j5/"(o ) of degree / 
at the CMB, calculated from Langel and Estes, 1982. Columns 3 and 4 give the half-lengths of 
the confidence intervals for {3 ["(a ) if the crustal error is systematic, or random and uncorrelated. 
The failure rate p is 1CT 4 . Using p- 10" 2 would leave column 3 unaffected and would decrease 
the entries in column 4 by about ten percent. 


